Variance Calculation

What Formula is best?

Below is an email message from Gary McClelland concerning the algorithms used in Microsoft Excel to compute certain statistics. This discussion has a long history, and is (was) taking place on the edstat listserv around Aug - Sept 1996. The lines that begin with one or more > are from Frank Fujita asking a clarifying question about using Excel. The lines that begin without a > are from Dr. McClelland in response to Frank Fujita's question.


You should note that we use the single pass algorithm in class when we use our calculation formula. Our theoretical "Think about it" formulas, and our statistics package use better algorithms (the package algorithms are better than our theoretical formulas). You may also find the reference Dr. McClelland provides quite valuable.



Date: Sun, 22 Sep 1996 20:41:06 -0400

From: Gary McClelland <Gary.McClelland@Colorado.edu>

To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>

Subject: Re: statistics using spreadsheets (fwd)

X-Comment: Statistics Education Discussion

> Gary McClelland (gary.mcclelland@colorado.edu) wrote:

> > Anyone who uses the poor statistical algorithms in spreadsheets deserves

> > what they get.

>

> Is there any example where doing the definitional formula in the

> spreadsheet produces these errors? I tried this one where I sum the

> numbers, divide by N to get the average, deviate from the average and

> average the deviations (this being the definition of variance -- the

> average squared deviation from the mean). It worked well here, are there

> cases where it will not work?

>

> Frank Fujita

> PS I'm definitely *not* talking about the functions or add-ins here.


This is exactly the point. There appears to be nothing wrong with

Excel's underlying numerical precision. The problem is with the

algorithms they use. Their functions and add-ins for computing

variance appear to use the old textbook one-pass formula for

computing the sum of squared deviations:

Sum X^2 - (1/N)(Sum X)^2

which was very useful when we had to calculations by hand or on

those clickity-clack Monroe mechanical calculators. But the

one-pass formula can be a disaster when implemented in a computer

algorithm. Your computations above use the much more numerically-

stable two-pass algorithm, and hence, they are much more precise

than the functions implemented in the Excel. It is called the

two-pass algorithm because you have to make one pass through all

the data to get the mean and then a second pass to calculate

and sum all the squared deviations. Even better is the

two-pass algorithm with correction which is

Sum(X-Xbar)^2 - (1/N)(Sum(X - Xbar))^2

With no rounding errors, the second term is 0. But if there are

any rounding errors, then the second term turns out to be a

good approximation of the rounding error that has been

added to the first term. Subtracting it then fixes the rounding

errors within a quite good approximation.

A good reference is

Chan, T.F., Golub, G.H., & LeVeque, R.J. (1983). Algorithms

for computing the sample variance: Analysis and recommendations.

American Statistician, 37(3), 242-247.

Incidentally, the Numerical Recipe books use the corrected two-pass

algorithm as recommended by Chan et al. Excel does not. Ironically,

while Excel uses a poor one-pass algorithm for variance, they

appear to use a better two-pass algorithm for skewness and kurtosis!

As I said before, they _could_ have done it right, but they

didn't. Several posters have also noted that hand calculators

often use the poor one-pass algorithm as well.

gary

-------------------------------------------------------------------

Gary.McClelland@Colorado.edu Dept of Psychology, CB345

http://psych.colorado.edu/~mcclella/ Univ of Colorado

voice: 303-492-8617 Boulder, CO 80309-0345

fax: 303-492-5580 USA

------------------------------------------------------------------