Top Twelve Tip #9

"All models are wrong; some models are useful" (quoted from G. E. P. Box), so choose the least hopelessly wrong model

(As with our other Top Twelve Tips, you'll get much more detail about these 12 in our Applied Environmental Statistics course, soon to be available on-demand online.)

There have been many ways over the years to find the ‘best’ regression model, but today many people still just maximize r-squared. There are better indicators of a good regression model than that. Modern indicators are ‘cost benefit analyses’, trading off the gain in decreased model error with the cost of adding new explanatory variables. r-squared has no cost term built in, so adding an additional x variable always increases r-squared, even if that variable is useless.

Modern indicators of regression quality include Mallow’s Cp, AIC, AICc (AIC corrected) and BIC. Each is a measure of model error, so that for a given dataset the model with the lowest value is the ‘best’. These indicators can be compared between models with varying numbers and types of explanatory variables, so can be used (for example) to compare a Y= x1 + x3 + x4 model against a Y= x2 + x3 + x5 model, as well as between "nested" models such as Y = x1 + x3 + x4 versus Y = x1 + x2 + x3 + x4.

With modern indicators, stat software goes beyond the previous capabilities of stepwise procedures and usually evaluates all the possible regression models with the variables at hand. Most commercial software as well as the free R system includes “all possible regression” routines. For example, with 6 explanatory variables there are (2^6-1) = 63 possible regression models: 6 possible one individual x-variable regressions, 15 possible two x-variable regressions, on down to the 1 possible six-variable model. One or more modern indicators is computed for all 63 models, and the scientist can select from among the best (lowest error) models. Models near ‘best’ but not the minimum error model may have the advantage based on other considerations, such as cost. The scientist lets the computer do what it does best, crunch numbers, and the human what it does best, evaluate and make decisions.

“All models are wrong…” tells you to be humble. Your regression result is only as good as the sampling design and data you have collected. Even with your best work, you are only estimating relationships in the real world. You can do better than r-squared. You can’t be perfect.