Top Twelve Tip #8
Meet the demands of regression: LNC


Linear regression makes three assumptions when fitting a straight-line model to data -- LNC. First, the y versus x relationship should be linear (L). There is little reason to fit a straight-line model to data that are curved. Predictions from the line would not be near where data are located. Second, the residuals should follow a normal distribution (N). This assures that the p-values are correct for this parametric process. If residuals are skewed, p-values will be too high, and significant x variables could be tossed away by mistake. Note that the normality assumption is not about the y variable, nor the x variable, but for their joint pattern as shown by residuals. Third, the variability around the line should be constant (C) for all values of x. Violation of this will again lead to a loss of power, possibly not seeing variables to be significant that actually are.

Environmental data commonly violate all three assumptions in their untransformed state.
Relationships appear curved, residuals skewed, and variation around the line increasing as the x variable increases. Evaluation of the three assumptions can be done using plots:

ResidPlots for TTT8

Upper left: a “residuals plot” to check if the pattern is linear.
Upper right: a probability plot to check the normality of residuals.
Lower left: a plot of standard error versus fitted values, to check whether variance is changing.


Alternatively, residuals can be checked for normality using the Shapiro-Wilk test and for constant / changing variances using the Breusch-Pagan or similar test.

For decades, noncompliance with assumptions was dealt with by transforming the y variable, often using logarithms. Logs frequently produce a straight-line pattern, near-normal residuals and constant variance. But the predicted values in original units are then geometric means (medians), not estimated means (see Tip #5). Today bootstrapping the regression relationship provides an alternative to transformation of the y-variable, avoiding normality and constant variance assumptions. However, the linear pattern remains important. If the data aren’t linear, don’t fit a linear model even with bootstrapping. Transformations to linearity still may be necessary.