All inferential statistical analysis can be captured in the following model:

DATA = MODEL + ERROR


ERROR DEFINITIONS

It is very important to be able to define what we mean when we use the term error. To illustrate let's suppose that your task was to guess the age of each person in class. Referring to the DATA = MODEL + ERROR statement above, the DATA we collect is the age of each individual enrolled in this class. You will need to develop a model to use to "guess" each person's age. The correctness of your guess will determine the ERROR. As the investigator (guesser) you will be rewarded when you are correct, and punished when you are in error.

Modal Error

Let's suppose that you develop as your MODEL this statement: Age = 24. Now you apply this model to each value for DATA. We would reward you when you are correct (One dollar for each correct guess), and punish you when you are in error (A small slap). To determine how much ERROR is involved in using this MODEL, we can simply count the number of errors (CE). Try it with the first 10 people in class.

When the count of errors is used (CE) the modal error is reported. There are only two values for the error (either we hit the statistician or we gave them a dollar) The typical error reported to readers is simply the more frequent of these two values.
Note that we didn't have to do any math to arrive at our MODEL. We just guessed 24 as the age. If we wanted to reduce our ERROR to a minimal value, we could have examined DATA to produce our MODEL.
What would we have "calculated" to minimize ERROR?
The Mode of the data is the correct answer.

Count of Errors

The figure to the left illustrates this point. There you see the six values in the frequency distribution shown on the right.
XF
101
93
82
The value you use to represent the sample data is indicated using the red line at the bottom of the figure. Errors are represented by data value colors, and are graphed using the green bar at the top of the figure. If, using the red line's value, you are incorrect, the data value's color is blue. The default placement of the red line is such that it is incorrect with respect to every data value. The error count is as high as possible (6 errors out of six possible errors). Click and drag the red line to change its position, and thus the value used to represent the data. As you move the red line, the value it represents is displayed in the figure's top line. If the data value is correctly represented, the color or the data value dot will change to red. At the top of the illustration, on the second line, the number of blue values (Error) will be counted and displayed. Note, that the mode gives the fewest count of errors. Because the mode is the score with the most frequencies, it will always minimize error if error is simply defined as the count of values incorrectly represented by a single value. Additional definitions of error will be explored later in this chapter.


Median Absolute Deviation

Now let's suppose that your MODEL is still age = 24, but this time, you want to define error a little differently. If the person's age is 25 or 23, so that you just missed by a single year, you would like a little slap given for this error. On the other hand, if the person's age is actually 20 or 28, so that you missed by 4 years, you would like them slapped 4 times harder. To determine error under this circumstance, you would find the absolute difference between each DATA value, and the MODEL, and then sum these absolute differences. This is called the sum of absolute errors (SAE)
When the sum of absolute errors (SAE) is used, it is customary to report the median absolute error. This is easily done in MYSTAT or SYSTAT by sorting the absolute errors into ascending order and finding the middle one.
Again, note that we didn't have to do any math to arrive at our MODEL (age = 24), but if we wanted to minimize our error (SAE) we could have calculated the median.

Sum of Absolute Differences

The figure on the left illustrates the data shown in the frequency distribution on the right.
xf
1402
1002
901
701
502
301
101
The number of frequences is 10. Thus, the median is the average of the value associated with the 5th frequency (70) and the next higher value (90). The median is 80 ((90 + 70)/2). The red line at the bottom of the figure is drawn to show the value selected to represent the data. Errors are shown as green lines from each data value to the red line. These errors are the distance each data value is away from the single representative value indicated by the red line and displayed in the first line in the figure. The green bar at the top of the figure graphs and displays in the figure's second line, the sums of distances shown by all the green error lines. Each error is found by subtracting the selected value (red line) from the actual data value. Finding the error using this method would lead to positive distances if the red line value was larger than the data value and to negative distances if the representational value was smaller than the data value. However, the error should be the same if the data value is equidistant from the representational value. Ten points above or ten points below the red line should be the same error. The data value was missed by the same amount. Therefore, these distances (errors) must be converted to absolute values before adding them together. Error defined in this manner is called sum of absolute differences. The median minimizes error defined in this manner. Click and drag the red line to change the value of the score selected to represent the entire data set. Drag the red line so that its value is 80 (median). Can you find a value that produces less error? Drag the red line so that its value is somewhere between the two values averaged to produce the median (70 - 90). Did the error value change? Why? Notice that anytime the representative value is between the two averaged values, half the frequencies are above, and half the frequencies are below the red line. If you move the line up one unit, one unit of error will be added to all the data values below the red line, but one unit of error will be subtracted from the other half of the data values above the red line. Thus, the total error between these two values remains constant.

Sum of Squared Errors (SSE)

Now let us suppose that you are still using a MODEL where age = 24, but this time, you wish to provide punishment for errors this way. If a person's real age is 23 or 25, so you missed by one unit, you slap with one unit of strength. If the person's age is really 20 or 28, so you miss by 4 units, you wish to punish by the square of this difference (you want to slap with an intensity of 16 units), you would need to square each difference and then sum them over all the DATA values. This is called the sum of squared errors (SSE).

Sum of Squared Error

xf
981
701
502
The mean minimizes the sum of squared error. The figure on the left, represents the data in the frequency distribution on the right, and can be used to illustrate this property of the mean. Like the other interactive figures in this chapter, the red line at the bottom is used to change the value used to represent the data. The red line is drawn at a default value of 2. If you calculate the mode for this data set, the value is 50. The value of the median is 60. The value of the mean is 67. The different colored squares in the figure represent error defined as the distance from the value given by the red line to the data value squared. The green bar at the top of the figure displays the sum of these squares. Now drag the red line to other possible values. Be sure to stop at the mode, median, and mean. Can you set the error value lower than that given with the red line's value is 67 (mean)? Sum of squared error values are used when one wants to set a high penalty for being further away from the data value. Here is an example. Suppose the representative value missed each of two data values by 1 and 4 points respectively. Using the sum of squared errors, the error is increased dramatically for the second value. Instead of having the error only increase by the 4 points missed, it is increased by the square of that miss (16 points).

Brain Exercise

Using the following exercise:
  1. Click a button to fit a given MODEL to DATA
  2. Calculate ERROR
  3. Enter ERROR without using a RETURN key
The fitted MODEL is shown using a line, and the individual ERROR values are shown using colored squares (you are calculating squared errors). If you enter a correct ERROR value, the box background will turn blue - We Are Penn State. You may enter as many different values as you need. However, do not press the RETURN key.


Variance or Mean Squared Error

When the sum of the squared errors (SSE) is used as the aggrate index of error, the variance (or as it is sometimes called Mean Squared error) is reported.

Note that again, our MODEL (age = 24) required no calculation. However, if we would have wanted to minimize our punishment, we could have calculated the mean and used it as our MODEL.
Note: the mean square error is the remaining ERROR per remaining potential parameter. I always wondered in ANOVA where the variance was, the mean squares are variance.

Standard Deviation

Of course, the square root of the MSE is the standard deviation. It is easier to interpret the square roots of these numbers.

Coefficient of Variation

Finally, a index sometimes used when SSE is used as the error is the coefficient of variation. It is common for the standard deviation to be proportional to the size of the mean. For example, you expect the standard deviation to be larger for IQs which have means of 100 than for test scores which have means of 10. To remove the effect of the overall magnitude of the data from the description of error, the coefficient of variation is reported. Coefficient of variation = CV =S/mean SYSTAT directly reports this value, you have to hand calculate it when using MYSTAT.

Brain Exercise

Use the Automobile fatality rate by state found in Exhibit 2.8 page 25 in Judd and McClelland and Statlets to calculate the variance, standard deviation, and coefficient of variation of the RATE variable. Remember that Exhibit 2.8 is simply Judd 1-1 sorted. Fill out the following form and submit it.

Report Variance:
Report To:
Your e-mail address:
Standard deviation followed
by coefficient of variation