Top Twelve Tip #11

Nondetects: never substitute. Mine the information in the proportions

The problem with substitution is what I’ve come to call “invasive data”. Substitution is not neutral, but invasive - a pattern is being added to the data that may be quite different than the pattern of the data itself. It can take over and choke out the native pattern. Consider the left plot below, a straight-line relationship between two variables, Concentration (y) versus Distance (x) downstream. The slope of the relationship is significant, with a strong positive correlation between the variables. Concentrations are increasing (perhaps with increasing urbanization) downstream. What happens when the data are reported using two detection limits of 1 and 3, and one-half the limit is substituted for the censored observations? The result (plot on right) includes horizontal lines of substituted values, changing the slope and dramatically decreasing the correlation coefficient between the variables. Looking only at these numbers, the data analyst obtains the (wrong) impression that there is no correlation, no increase in concentration. When adding an invasive flat line to the original data, the original relationship may easily be missed.

F_i4 F_i5

Or consider the case at the left, below, where there is no pattern – air concentrations with no trend. Again, reporting limits are required in the laboratory, and those limits decrease over time, generally a good thing. Substituting one-half the limit produces artificial values (red squares, below right) that head down over time, and the trend eventually appears significant even though there is no actual trend in the air concentrations themselves. It was added by the scientist’s unfortunate data practices. How many reported trends have resulted from practices like this?

TT11  Concentration vs Time TT11  Censored vs Time

There are better ways. Methods exist for what statisticians call ‘censored data’, where the individual value is not known, but it is known to be above or below a numerical threshold. These methods use the two types of information available in the data: the known values of detected concentrations, and the proportion of data, both detected and not, below each reporting limit. By mining the information in the proportions, statistics such as the mean and UCL95, regression equations, and hypothesis tests can all be computed. All without substituting any fabricated values for nondetects. See the book Statistics for Censored Environmental Data using Minitab and R (Helsel, 2012) for more detail on data analysis with nondetects.

Online at:

<—- Back to the Top 12 Tips Listing page