
Below is an email message from Gary McClelland concerning the algorithms used in Microsoft Excel to compute certain statistics. This discussion has a long history, and is (was) taking place on the edstat listserv around Aug - Sept 1996. The lines that begin with one or more > are from Frank Fujita asking a clarifying question about using Excel. The lines that begin without a > are from Dr. McClelland in response to Frank Fujita's question.
You should note that we use the single pass algorithm in class when we use our calculation formula. Our theoretical "Think about it" formulas, and our statistics package use better algorithms (the package algorithms are better than our theoretical formulas). You may also find the reference Dr. McClelland provides quite valuable.
Date: Sun, 22 Sep 1996 20:41:06 -0400
From: Gary McClelland <Gary.McClelland@Colorado.edu>
To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>
Subject: Re: statistics using spreadsheets (fwd)
X-Comment: Statistics Education Discussion
> Gary McClelland (gary.mcclelland@colorado.edu) wrote:
> > Anyone who uses the poor statistical algorithms in spreadsheets deserves
> > what they get.
>
> Is there any example where doing the definitional formula in the
> spreadsheet produces these errors? I tried this one where I sum the
> numbers, divide by N to get the average, deviate from the average and
> average the deviations (this being the definition of variance -- the
> average squared deviation from the mean). It worked well here, are there
> cases where it will not work?
>
> Frank Fujita
> PS I'm definitely *not* talking about the functions or add-ins
here.
This is exactly the point. There appears to be nothing wrong with
Excel's underlying numerical precision. The problem is with the
algorithms they use. Their functions and add-ins for computing
variance appear to use the old textbook one-pass formula for
computing the sum of squared deviations:
Sum X^2 - (1/N)(Sum X)^2
which was very useful when we had to calculations by hand or on
those clickity-clack Monroe mechanical calculators. But the
one-pass formula can be a disaster when implemented in a computer
algorithm. Your computations above use the much more numerically-
stable two-pass algorithm, and hence, they are much more precise
than the functions implemented in the Excel. It is called the
two-pass algorithm because you have to make one pass through all
the data to get the mean and then a second pass to calculate
and sum all the squared deviations. Even better is the
two-pass algorithm with correction which is
Sum(X-Xbar)^2 - (1/N)(Sum(X - Xbar))^2
With no rounding errors, the second term is 0. But if there are
any rounding errors, then the second term turns out to be a
good approximation of the rounding error that has been
added to the first term. Subtracting it then fixes the rounding
errors within a quite good approximation.
A good reference is
Chan, T.F., Golub, G.H., & LeVeque, R.J. (1983). Algorithms
for computing the sample variance: Analysis and recommendations.
American Statistician, 37(3), 242-247.
Incidentally, the Numerical Recipe books use the corrected two-pass
algorithm as recommended by Chan et al. Excel does not. Ironically,
while Excel uses a poor one-pass algorithm for variance, they
appear to use a better two-pass algorithm for skewness and kurtosis!
As I said before, they _could_ have done it right, but they
didn't. Several posters have also noted that hand calculators
often use the poor one-pass algorithm as well.
gary
-------------------------------------------------------------------
Gary.McClelland@Colorado.edu Dept of Psychology, CB345
http://psych.colorado.edu/~mcclella/ Univ of Colorado
voice: 303-492-8617 Boulder, CO 80309-0345
fax: 303-492-5580 USA
------------------------------------------------------------------