Changes in Excel 2010 have improved its use for statistics considerably. For earlier versions of Excel, however, the answer is generally ‘No’. The following refers to versions prior to 2010 (2011 on the Mac). Some of the links below may no longer be active.

Excel is available to many people as part of Microsoft Office. People often use Excel as their everyday statistics software because they have already purchased it. Excel’s limitations, and its errors, make this a very questionable practice for scientific applications. For business applications where questions might be simpler and precision not as necessary, Excel may be just fine. Below are some of the concerns with using Excel for statistics that are recorded in journals, on the web, and from personal experience. References for more detail are listed at the bottom of this page.

“

- Yalta (2008), ref 1 below

“

- McCullough and Heiser (2008), ref 2 below

"

- Statistical Services Centre, Univ. of Reading, U.K. (at A, below)

"

-Neil Cox, ref 7 below

"

- The Gartner Group

1. Many statistical methods are not available in Excel.

Commonly-used statistics and methods NOT available within Excel include:

* Boxplots

* p-values for the correlation coefficient

* Spearman’s and Kendall’s rank correlation coefficients

* 2-way ANOVA with unequal sample sizes (unbalanced data)

* Multiple comparison tests (post-hoc tests following ANOVA)

* p-values for two-way ANOVA

* Levene’s test for equal variance

* Nonparametric tests, including rank-sum and Kruskal-Wallis

* Scatterplot arrays or brushing

* Principal components or other multivariate methods

* GLM (generalized linear models)

* Survival analysis methods

* Regression diagnostics, such as Mallow’s Cp and PRESS (it does compute adjusted r-squared)

* Durbin-Watson test for serial correlation

* LOESS smooths

Excel still requires the X variables to be in contiguous columns in order to input them to a multiple regression. This can be done with cut and paste, but is certainly annoying if many multiple regression models are to be built.

Excel's lack of functionality limits its use to only simple statistical tasks. For alternatives, see the 3 Solutions presented below.

2. Several Excel procedures are misleading/wrong.

Probability plots are a standard way of judging the adequacy of the normality assumption in regression. In statistics packages, residuals from the regression are easily, or in some cases automatically, plotted on a normal probability plot. Excel’s regression routine provides a Normal Probability Plot option. However, it produces a uniform distribution probability plot for Y instead of a normal probability plot for the residuals, even though it is labelled as the latter. The plot is therefore useless and misleading for judging the adequacy of regression residuals.

Excel’s CONFIDENCE function computes z intervals using 1.96 for a 95% interval. This is valid only if the population variance is known, which is never true for experimental data. Confidence intervals computed using this function on sample data will be too small. A t-interval should be used instead.

Excel is inconsistent in the type of p-values it returns. For most functions of probabilities, Excel acts like a lookup table in a textbook, and returns one-sided p-values. But in the TINV function, Excel returns a 2-sided p-value by default. Look carefully at the documentation of any Excel function you use, to be certain you are getting what you want. Excel's t-test is not the issue, it is when you use the underlying function that you may not get what you are expecting. For example, to compute a 95% t confidence interval around the mean, the standard method is to look up the t-statistic in a textbook by entering the table at a value of alpha/2, or 0.025. This t-statistic is multiplied by the standard error to produce the length of the t-interval on each side of the mean. Half of the error (alpha/2) falls on each side of the mean. In Excel the TINV function is entered using the value of alpha, not alpha/2, to return the same number. In Excel 2007 this is at least more obvious in the metadata provided onscreen.

The SOLVER function is used to solve nonlinear equations, including approximate maximum-likelihood solutions. SOLVER unfortunately frequently returns a solution prior to finding a global minimum or maximum, and so presents incorrect slopes and other coefficients as if they were correct (ref 2). In short, it can give back the wrong answers! In testing by McCullough and Heiser (ref 2), the wrong answer was returned 41% of the time.

3. Distributions are not computed with precision.

In reference 1, Yalta presents a table of numerical errors reported for Excel 97, carrying them through for later versions. Almost all are listed as “Not fixed” or “Poor fix” in Excel 2000, 2003 and now in 2007. In short, essentially all numerical problems found in Excel 97 are still there in Excel 2007.

Statistical distributions used by Excel do not agree with better algorithms for those distributions at the third digit and beyond. So p-values are approximately correct, but not as exact as would be desired by a statistician. This may not be harmful for hypothesis tests unless the third digit is of concern (a p-value of 0.056 versus 0.057). It is of most concern when constructing intervals (multiplying a std dev of 35 times 1.96 give 68.6; times 1.97 gives 69.0) As summarized in reference 4:

"

Also see

4. Regression routines are incorrect for multicollinear data.

This affects multiple regression. A good statistics package will report errors due to correlations among the X variables. The Variance Inflation Factor (VIF) is one measure of collinearity. Excel does not compute collinearity measures, does not warn the user when collinearity is present, and reports parameter estimates that may be nonsensical. See ref (8) for an example on data from an experiment. Are multicollinear data of concern in ‘practical’ problems? I think so -- I find many examples of collinearity in environmental data sets.

5. Ranks of tied data are nonstandard.

When ranking data, standard practice is to assign tied ranks to tied observations. The value of these ranks should equal the median of the ranks that the observations would have had, if they had not been tied. For example, three observations tied at a value of 14 would have had the ranks of 7, 8 and 9 had they not been tied. Each of the three values should be assigned the rank of 8, the median of 7, 8 and 9.

Excel assigns the lowest of the three ranks to all three observations, giving each a rank of 7. This would result in problems if Excel computed rank-based tests. Perhaps it is fortunate none are available.

6. Many of Excel's charts violate standards of good graphics.

Use of perspective and glitz (donut charts!) violate basic principles of graphics. Excel's charts are more suitable to USA Today than to scientific reports. This bothers some people more than others.

Excel "

-- Y-S Su, ref (3).

"

Solution #1: Excel add-ins

Some of these limitations can be overcome by using add-in routines that avoid using Excel’s computation procedures. Many of these were evaluated in our newsletters reviewing statistical software for under $600 (see

---Warning

Most of these add-ins alter your copy of Excel. Perform a complete backup prior to installing them, in case you don't find them helpful and want to go back to the pristine version. Uninstalling add-ins is not always simple.

Solution #2: Alternatives to Excel

Yalta (ref 1) states that p-values [inverse probability distributions] reported by the free

Solution #3: Purchase statistics software, or use R (industry standard and free)

Our

Articles:

(1) The accuracy of statistical distributions in Microsoft Excel 2007

A.T. Yalta, (2008).

(2) On the accuracy of statistical procedures in Microsoft Excel 2007

B.D. McCullough and D.A. Heiser, (2008).

(3) It’s easy to produce chartjunk using Microsoft Excel 2007 but hard to make good graphs

Y-S. Su, (2008).

(4) On the accuracy of statistical procedures in Microsoft Excel 2003

B.D. McCullough and B. Wilson, (2005),

(5) On the accuracy of statistical procedures in Microsoft Excel ‘97

B.D. McCullough and B. Wilson, (1999),

(6) Problems with using Microsoft Excel for statistics

J.D. Cryer, (2001), presented at the Joint Statistical Meetings, American Statistical Association, 2001, Atlanta Georgia

(7)

Neil Cox, (2000), AgResearch Ruakura

(8) Statistical analysis using Microsoft Excel

Jeffrey Simonoff, (2002)

(9) On the Accuracy of Statistical Distributions in Microsoft Excel 97

Leo Knuesel

(10) Statistical flaws in Excel

Hans Pottel

Websites with more detail:

(1)

Patrick Burns

(2)

David Heiser

(3)

ProcessTrends.com

(4)

Stephen Few

Guides to Excel on the web:

(A)

(B)

(C)

Note: All opinions other than those cited as coming from others are my own.

Online at: