production date 2/5/00

Measures of Variability

Table of Contents Objectives
Range A poor measure of scatter.
A Solution Beginning An attempt to better measure scatter.
The Solution -- Variance Variance solves a mathematical problem.
Standard Deviation The square root of variance.
Calculating Variance and Standard Deviations Learn to calculate these two statistics.
Coefficient of Variation Fixing the Standard Deviation.
Normal Curves Learn about the standard deviation's meaning in normal distributions.
Skew Revisited Formulas for Skew.
Additional Information Discover interesting Web Links
Computer Project 10 Using Statlets to calculate measures of variation.
Computer Project 11 Another measure of variation calculation.
Percentile Ranks Position measures that lead to the Interquartile Range.
Computer Project 12 Interpreting Q-plots and P-plots.
Questions/Test Take the End of Chapter Test
Report Send a Chapter Report to your Instructor


Another characteristic of data that is very important is how much individual scores vary or scatter. Therefore, statisticians have found several ways to describe scatter in data sets. These scatter or variability measures provide an index of the similarity of the scores in a distribution. Variability may be said to be a measure of the extent to which scores spread out or scatter around a measure of central tendency. Statisticians usually use the mean as the measure of central tendency when calculating most of these scatter indices.

Low variability indicates that the scores in a distribution are tightly bunched together. High variability indicates that the scores are more dispersed from one another. There are four common measures of variability (1) range, (2) variance, (3) standard deviation and (4) the coefficient of variation. In this chapter you will learn to understand and calculate all four.

A subject's position within a distribution is often indicated by reporting their percentile rank. The values associated with certain percentages of scores below a particular value are called percentiles. For example, we have defined the median as the score at the 50th percentile. Half the scores (50%) are above the median, and half the scores are below it. Two other important percentiles are given special names. The 25th percentile is known as the lower(or first) quartile, and the 75th percentile is known as the upper (or third) quartile. These two special percentiles are used to calculate a less common measure of variability the interquartile range.


Range   

To calculate the range you simply take the largest score in a distribution and subtract the smallest score from it. For this tiny data set {10, 9, 8, 7, 6, 5} the exclusive range is 10 - 5 = 5.

The range is said to be a total variability measure that indicates the complete range or spread in the distribution. Ranges are extremely unstable from sample to sample. As we have mentioned before, statisticians take samples from populations, calculate sample statistics and estimate population parameters with them. If the sample range changes from sample to sample, obviously, it can not be a good estimator of the population range. Sample ranges are only useful as rough indicators of variability. They serve no other statistical function. Once we have calculated and reported a range, the value is not used in any other calculations.

One of the reasons that sample ranges are not good estimators of population ranges is that the range only uses two scores from the entire distribution. The applet on the left has six data values represented by six dots at the bottom of the figure. By default, these values are all set at three. At the top of the figure is shown the highest score, the lowest score and the range. Initially both the high and low scores are three and the range is therefore zero. Click and drag one or more of the data points to change their values. You will note that the range only depends on two scores (high and low). Can you arrange the figure in different ways where the high and low scores and the range are the same, but the scatter of the data values is dramatically different?

Some Problems

We can now calculate ranges as our measures of variability. However, as noted above, ranges are not good measures of scatter because their values only depend on two scores from a distribution. A better measure of variability would take into account all of the individual scores in a distribution. The figure on the left dramatically illustrates this problem. Which distribution has the most scatter? Do the ranges for these two distributions differ?


A Solution Beginning   

Measures of scatter may not have been derived exactly as explained below. However, thinking about the derivations in this way has helped many students better understand these measures. First, you are led down a blind alley to a solution that appears to work. Indeed, one of the problems with this first solution is that it does what it proposes to do, measure scatter. However, it doesn't do anything else. It doesn't lead to other statistical analyses, as the second solution does. Before you get to the second important solution, Investigating this blind alley has proven valuable.

Because the mean is the usual "best" measure of central tendency, statisticians began to use it as an anchor or reference point to which all the other scores in a distribution could be compared. Statisticians thought that deviations of each score away from the mean might lead to a measure of scatter that, unlike the range, considered every score in its calculation.

A Blind Alley

First, statisticians thought that they could simply add all the deviations from the mean together to arrive at a measure of scatter. However, they soon found that the sum of these deviations always added to zero. Indeed we have stated that a property of the mean is that algebraic deviations from it always sum to zero.



The first attempt to get around this zero sum property was to use absolute values. However, as figure on the right illustrates, this doesn't work. Sums of absolute difference values for distributions that have the same scatter are different if there are more values in one of the distributions.

You should know that |X| is read as the absolute value of X. We worked with this type of error in the last chapter, calling it the Sum of Absolute Differences. You will remember that the median minimized this type of error. The mean did not.

A Partial Solution

Simply summing the absolute deviations is not enough. The scatter in the two distributions shown in the figure above on the right are the same, Set 2 just has more elements in the distribution. Mathematicians decided to take the average of these deviations. This descriptive statistic (the formula is shown on the left) is called the mean deviation (MD). If we do the calculation for MD we properly get the same value for both distributions.

Mean deviations are good measures of variability, but they are dead ends. There are no other statistics built on the idea of mean deviations. There are also difficulties with just eliminating the signs of numbers using the absolute value function.

If the next step was not thought of, the study of statistics would be finished. All of the statistical analytic techniques would have been discussed. Can you think of another mathematical way to eliminate the negative values in the deviations from the mean? Someone proposed squaring the deviations instead of taking their absolute values before adding them up and taking their average. This statistician discovered what we now call variance.

The square of these deviations was discussed as the Sum of Squares Error in the previous chapter. You will remember that the mean minimized the sum of squares.


The Solution - Variance   

The calculation of variance solved one difficulty and is a major component in many other statistical procedures. Indeed, if you continue your study of statistics, the next course you take will probably be titled Analysis of Variance, while the course after that won't have this title, it could be called Sophisticated Analysis of Variance, and the course after that could be called Very Sophisticated Analysis of Variance.

While the titles above won't actually be seen in the course catalog, variance is one of the most important concepts and calculations conducted in statistics. Variance is a number that represents the average of the squared deviations around the mean. A large variance indicates that there is large scatter in the distribution. Small variance indicates that the scores are bunched tightly around the distribution's mean.

There are two different formulas for calculating variance. One of the formulas is used for populations and the other is used to calculate a sample statistic. Remember that sample statistics are used to estimate unknown population parameters. You will calculate sample variance in order to estimate an unknown population variance. Plugging the identical values into these formulas will give different answers because the formulas are slightly different. The formula for population variance, shown on the left, is straight forward. This formula is used for calculating variance if the distribution contains the entire population of interest.

Note that the population parameter is abbreviated with a Greek letter (sigma), and the mean is a population parameter (mu), and the number of scores is represented with a capital N.

If we simply substituted sample statistics directly into this formula the population variance would be inaccurately estimated. To understand why, begin by thinking about how sample ranges estimate population ranges. Suppose you have a large population and calculate the population range (subtract the smallest score in the population from the largest score). Now suppose you take a sample of scores from this population and calculate the sample range (subtract the smallest score from the sample from the largest score in the sample). Unless the sample contained both the largest and smallest scores from the population the sample range would always be smaller than the population range. In most cases, the sample would not contain the largest and smallest scores in the population. In most cases the sample range would underestimate the population range. The sample range could never be larger than the population range. At the very best, the sample range could equal the population range. On average, the sample range would underestimate the population range. The sample range is called a biased statistic, because on average it does not equal the population parameter.

The same thing would happen to sample variance estimates if we did not correct them. On average, a sample will have a bit less scatter than the population. Statisticians found that if they subtracted one from the denominator of the variance equation that this slightly increased the result and the average of all the sample variances taken from a population accurately estimated the population variance. By subtracting one case from the denominator, the statistic becomes an unbiased estimator of the population parameter. To make sure we can tell the differences between the variance formula for populations and samples, we make sure that in the sample formula (shown on the left) that statistics are substituted for parameters and the letter "n" which indicates the size of the sample is lower case.

The figure on the left shows two populations which have different variability. The first set has small scatter, while the second set has a much larger scatter. Note that both means are equal to 9. When you calculate the variance for the first distribution you get a relatively small number (.8571) while the variance in the second population is relatively large (53.1429).

The equations for variance shown above, are what statisticians refer to as "Think about it" formulas. They make sense when you simply look at them. People understand what the equations are doing. They also make sense in that they are measuring scatter. However, they are horrid for use in hand calculators. (As an aside, they are rarely used in computer programs either.) The figure on the right presents equations that are quite useful in hand calculators, but are quite poor when used in computer programs. I don't think that on the surface they make much sense. However, using your calculator, you can arrive at an answer without ever stopping to write down an answer for any step. These formulas are known as the calculator formulas in many textbooks.

It is very important that you can both calculate and understand the concept of variance. The figure on the left allows you to set six variables to values between -2 and 146. As you click and drag the balls that represent the data values, the mean and variances for both samples and populations will be calculated and displayed. The actual value of the mean is calculated and displayed on the first line of the figure. The position of the mean is also shown using a red line drawn across the figure. This line is clearly labeled as the mean on the far right side of the figure. The variances are displayed in the second and third figure lines. The green bar at the top of the figure dynamically graphs variance changes. The longer this bar, the more variance. Finally, each of the data values are displayed at the bottom of the figure. You can use these values for hand calculations using the answers to check your work.

There is one more variance formula, shown on the left, which can be used if the data constitutes a population and all the values are dichotomous. Dichotomous data consists of only two values (0, 1). These type of data are often found in the social sciences. Zero might indicate failure on a examination item, while one indicates passing, or 0 might be male while 1 is female. If you have population data that is dichotomous, this equation can be used to calculate variance. Where p = the proportion of passes or correct responses or the proportion of 1s; q = the proportion of incorrect responses, or failures or 0s. Remember that q must equal 1-p.

The figure on the left demonstrates the calculation of variance using dichotomous data. Dichotomous data is also called binary data.


Standard Deviation   

The standard deviation is simply the square root of variance. Standard deviations return the variability measure back to the original score units instead of squared score units. The equations on the left provide you with the "Think about it" and calculator equations for standard deviations.

Using these equations is quite simple. If you can calculate variance then just press the square root button on your calculator, and you can calculate standard deviations. The figure on the left allows you to drag the data values to any position on the graph. The population and sample standard deviations are calculated along with the mean. The green bar at the top of the figure dynamically graphs changes in the standard deviation. The longer the bar, the larger the standard deviation. Again, use this figure to check your ability to hand calculate the standard deviation.


Calculating Variance and Standard Deviations   

The environment provides several ways to quickly calculate variance and standard deviations. First, if you are working with simple calculations with 10 or fewer values (typical student exam questions), you can use the simple variance calculator, or the simple standard deviation calculator. For larger problems, you can use Statlet's single applet. Type your data into the spreadsheet, and make sure that variance and standard deviation are checked in the Stats Option dialog. If you need to copy and paste data into Statlets, click the Statlets' button on the main navigation panel, and use the menu version. If you are calculating the range, variance or standard deviation using the menu version of Statlets, the computer projects 8 and 9 demonstrated procedures that include the calculation of these statistics.


Coefficient of Variation   

The standard deviation has one small problem if it is used to compare the scatter of one distribution to another. It is quite common for the size of the standard deviation to be proportional to the size of the mean. That is, given the same amount of scatter, we would expect standard deviations to be larger if the mean of the data was 20,000 instead of 20. Although this is not always true, it is frequently true. To remove the effect of the overall size of the variable values, the coefficient of variation is calculated. The coefficient of variation is simply found by taking the standard deviation and dividing by the mean. The advantage of using the coefficient of variation to express scatter is that coefficients of variation are comparable across data sets with dramatically different means.


Normal Curves   

If you have a variable that is normally distributed (many many variables are), then standard deviations are important because they allow the calculation of confidence intervals into which certain known percentages of scores reside. Approximately 68% of the scores in a normal distribution are between the mean and ± 1 standard deviation. Approximately 95% of the scores in a normal distribution are between the mean and ± 2 standard deviations. The figure shown on the left illustrates this property of normal distributions.


Skewness Revisited   

In Chapter 4 we discussed when one would choose to report a mean, median, or mode as the measure of central tendency. We stated that, if the distribution was unimodal and seriously skewed that the median should be reported. At that point, we said that if the value for standardized skewness was outside ±2 that the distribution could be considered seriously skewed. You have now come to a point where at least three formulas for measuring skew can be given. The first formula (not presented because of its simplicity) is taken from Richard P Renyon and Audrey Haber's text Fundamentals of Behavioral Statistics (7th Ed.), published in 1991 by McGraw Hill. The authors state that when the mean is higher than the median, the distribution of scores is positively skewed. Conversely, when the sample mean - median is a negative value, the scores are negatively skewed. However, these indices of direction of skew tell us little about the amount of skew. E. S. Pearson, whom many consider the founder of modern statistics, proposed the coefficient of skew (sk), shown in the figure on the left, where SK is coefficient of skew, Mdn is the sample median, the sample mean is indicated by x-bar, and s is the standard deviation of the distribution.

The third formula, shown on the right, is frequently reported in statistics texts. The deviation of each value from the mean is taken to the third power. The sum of these deviations are then divided by the variable's standard deviation.

A positive sign indicates that the scores are positively skewed. If the distribution is symmetrical, the mean and median are the same.
Therefore SK = 0.


Additional Information   

As noted in this Chapter Karl Pearson is thought by many to be the father of modern statistics. You can learn more about Karl Pearson by visiting a web site with information about many mathematicians.


Computer Project 10   

Calculating Variance and Standard Deviations

To do this tenth computer project, you need to first read the directions. Next, you need to close the page with the directions and look at the questions and possible answers in the project report.
To view the project report, you must be able to establish an active internet connection.
The project report will appear on a secondary page. After reading that secondary page, do not close it. Simply move the report page so that you can see this page (click and drag the window's title bar to expose this primary page). Click this page to activate it, and start Statlets by clicking the Statlets button on the Navigation panel. After completing the project, and if instructed to do so by your instructor, click the project report page to activate it, and answer the questions. After clicking the submit button, close both the report window, and the Statlet's windows.


Computer Project 11   

Calculating variation using Analyze/One Variable/One Sample Analysis

To do this eleventh computer project, you need to use the Obedience to Authority data. Calculate the variance, standard deviation, range, and coefficient of variation of the variable Volts using the Analyze/One Sample/One Variable Analysis. If you need assistance with this procedure, you can read the user manual pages for this procedure again.

If requested by your instructor submit the project report.
To view the project report, you must be able to establish an active internet connection.
The project report will appear on a secondary page. After clicking the submit button, close both the report window, and the Statlet's windows.


Percentile Ranks   

Measures leading to the Interquartile Range

Percentiles

By definition, the p-th percentile is a value below which lies p% of the data. Thus, the 25th percentile is the score where 25% of the scores in the distribution from which it comes lie below it's value. The Percentiles tab in the Analyze/One Sample/One Variable Analysis calculates percentage values for distributions.

The figure below shows the Percentiles Tab output. Notice that by default, the sample values for the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles are calculated. By using the Options button, you can change these default values.


Three of those default values are special enough to have other names. The 25th percentile score is also called the lower (or first) quartile, the 50th percentile score is also called the median, and the 75th percentile score is also called the upper (or third) quartile.

Quantile Plots

Quantile or Q-plots are created by first ordering all the data from low to high values. The variable values are plotted along the horizontal axis, and their vertical position is determined by the following formula (i-0.5)/(n+0.25) for i = 1, 2, ..., n. This is close to having the vertical axis determined by the cumulative percentages of each value. The important point to remember is that if the data come from a normal distribution, the Q-plot will form the classic S-shaped form shown in the figure below:


Probability Plot

Probability or P-plots are very much like Q-plots except that the vertical axis is scaled in such a manner that if the data come from a normal distribution, the points will fall along a straight line as shown in the figure below.


Interquartile Range

The interquartile range is a less common measure of variability. The interquartile range equals the difference between the upper quartile and the lower quartile. The lower quartile indicates the point below which lies 25% of the data. The upper quartile indicates the point below which lies 75% of the data. Therefore, the interquartile range captures the middle 50% of the data values. Interquartile ranges are calculated using the Stats tab in the Analyze/One Sample/One Variable Analysis procedure. The interquartile range is also indicated by the distance across the box in a box-and-whisker plot.


Computer Project 12   

Use the Obedience to Authority data and plot both a Q-plot and P-plot. If requested by your instructor submit the project report.
To view the project report, you must be able to establish an active internet connection.
The project report will appear on a secondary page. After clicking the submit button, close both the report window, and the Statlet's windows.


Questions/Test   

This link allows you to take a computer scored end-of-chapter test. If your instructor requests to see the results of this examination, you can either copy and e-mail or print the feedback you will receive immediately after taking the test.

Report   

Please send a report indicating your understanding of this chapter to your instructor. You will need to know both your and your instructor's e-mail addresses.