Return to Part2 of Chapter 8.
Return to the Syllabus

Testing Other Special Hypotheses
The number of special hypotheses that can be tested within multiple regression is virtually endless.
These are the general steps in the strategy

  1. Formulate the question as a null hypothesis about the ßj's or as a linear combination of the ßj's.
  2. Estimate the parameters and calculate SSE(C) for the compact model which incorporates the null hypothesis.
  3. Estimate the parameters and calculate SSE(A) for the augmented model which removes the restrictions imposed by the null hypothesis.
  4. Use SSE(C) and SSE(A) to calculate PRE and F.

Lets work a problem with our School Referrals data set.
Suppose we are school psychologists and we have been filled with the notion that PIQs are as important as VIQs. With respect to predicting achievement, we can test this hypothesis. Below is the output using all 200 cases in the School Referrals data. The output was produced using SYSTAT. Suppose READ is our criterion measure.
MODEL A: READ = ßo + ß1VIQ + ß2PIQ + ei

STEP 1:
If we believe that VIQ and PIQ were equally important then we believe that ß1= ß2. Or stated differently that ß1- ß2= 0.
That is our null hypothesis.
MODEL C: READ = ßo + ß1VIQ + ß1PIQ + ei
Since b1= b2.
Or rewritten:
MODEL C: READ = ßo + ß1(VIQ + PIQ) + ei

Now the trick, and it is a fairly general trick, is to construct a new variable, say VIQPIQ for the combined IQs, which equals VIQ + PIQ. Again, do this in the Data Editor.

STEP3a:
Here is the ANOVA table from MODEL C:

                             ANALYSIS OF VARIANCE

   SOURCE   SUM-OF-SQUARES    DF  MEAN-SQUARE     F-RATIO       P

 REGRESSION       9995.606     1     9995.606      60.807       0.000
   RESIDUAL      32547.894   198      164.383

STEP 3b:
Here is the unrestricted MODEL A output.

                            ANALYSIS OF VARIANCE

   SOURCE   SUM-OF-SQUARES    DF  MEAN-SQUARE     F-RATIO       P

 REGRESSION      12512.502     2     6256.251      41.040       0.000
   RESIDUAL      30030.998   197      152.442

STEP 3c:
Of Course PRE = 0.07733
PA = 3, PC = 2
F = 16.51. We obviously reject MODEL C.

This same strategy can be used to solve countless problems using regression techniques.


I have produced a new data set called School Referrals 3 where VIQ and PIQ were added together to make a new variable VIQPIQ. Variables that were not needed for this analysis were also eliminated from the data set. Use Statlets to solve the same problem as above. With the academic version of Statlets only the first 100 cases can be entered, so your results will differ from those above.
Students are directed to closely examine pages 178-181 in Judd & McClelland.

CROSS VALIDATION
pg 178. You know someone else's regression equation. That becomes MODEL C with no estimated parameters. You can calculate SSE(C) in the Data Editor of some statistical packages, or using a spreadsheet program like Microsoft Excel. MODEL A is your sample data. You estimate all the parameters. The PRE and F equations are used. Of course, you have the same problems with all multiple df tests.


COMPLEX EXAMPLE
pg 179. The authors want to determine if one predictor is equal to the average of two others or if

Their augmented model is:
GPA = ßo + ß1HSRANK +ß2SATV +ß3SATM + ei
The compact model was then

multiply every term by 2
2GPA =2 ßo +(ß2 + ß3)HSRANK +ß2(2)SATV + ß3(2)SATM + ei
divide every term by 2 (or multiply by .5)
GPA = ßo +(ß2 + ß3)(.5)HSRANK +ß2SATV + ß3SATM + ei
do the multiplications
GPA = ßo + ß2(.5)HSRANK + ß3(.5)HSRANK +ß2SATV + ß3SATM + ei
collect similar terms
GPA = ßo + ß2((.5)HSRANK +SATV) + ß3((.5)HSRANK + SATM) + ei
We then do the trick mentioned previously by constructing the two new terms.
Using these new terms, the compact model can be estimated.
GPA = ßo + ß2X'13X'2 + ei

Notice the superscripts on the predictor variables indicating that they are derived variables.

Interpretation of Partial Regression Coefficients

To further your understanding of regression output, use the data in Exhibit 8.17 on page 189 in Judd & McClelland.

The first question to ask is whether Industrial Production (IP) can help predict Unemployment.
MODEL C: UN = ßo + ei
MODEL A: UN = ßo1IPi + ei
Here are the results from SYSTAT
DEP VAR:      UN      N:      10   MULTIPLE R: 0.313  SQUARED MULTIPLE R: 0.098
ADJUSTED SQUARED MULTIPLE R: 0.000     STANDARD ERROR OF ESTIMATE:        0.972

  VARIABLE    COEFFICIENT    STD ERROR     STD COEF TOLERANCE    T    P(2 TAIL)

CONSTANT           -0.035        3.081        0.000      .      -0.011    0.991
      IP            0.021        0.022        0.313     1.000    0.931    0.379


                             ANALYSIS OF VARIANCE

   SOURCE   SUM-OF-SQUARES    DF  MEAN-SQUARE     F-RATIO       P

 REGRESSION          0.819     1        0.819       0.867       0.379
   RESIDUAL          7.557     8        0.945

--------------------------------------------------------------------------------------
Or from the Statlet's using the menus Models/Regression/Simple Regression. Note everything checks with the SYSTAT output.
-------------------------------------------------------------------------
                           Analysis of Variance
-------------------------------------------------------------------------
Source         Sum of Squares   Df    Mean Square    F-Ratio      P-Value
-------------------------------------------------------------------------
Model          0.81931          1     0.81931        0.867375     0.3789
Residual       7.55669          8     0.944586
-------------------------------------------------------------------------
Total (Corr.)  8.376            9

Obviously the answer is that we fail to reject the null hypothesis.



Above, and in class I have made a point of noting Statlet's menu choices for this bivariate regression output. If you use the menus Models/Regression/Multiple Regression and submit the same model, you will get this output.
Multiple Regression Analysis
---------------------------------------------------------------------------
Dependent variable: UN
---------------------------------------------------------------------------
                                       Standard          T
Parameter               Estimate         Error       Statistic      P-Value
---------------------------------------------------------------------------
CONSTANT              -0.0351724        3.08106          -0.01       0.9912
IP                     0.0206897      0.0222152           0.93       0.3789
---------------------------------------------------------------------------
 
                           Analysis of Variance
---------------------------------------------------------------------------
Source         Sum of Squares     Df    Mean Square    F-Ratio      P-Value
---------------------------------------------------------------------------
Model                 0.81931    1.0        0.81931       0.87       1.0000
Residual              7.55669    8.0       0.944586
---------------------------------------------------------------------------
Total (Corr.)           8.376    9.0
Note that the F ratio of .87 with 1 and 8 degrees of freedom has a p -value of 1.00. This differs dramatically from the p-value given for the same statistic above. For now, when doing bivariate regressions use the Simple Regression routines or square the t value and use it's probability if you are using the multiple regression routine to solve bivariate problems.


This counterintuitive result (that industrial production is not related to unemployment) suggests that we look closer at the data.

Looking at the data and the following scattergram, we see that unemployment increases across the years. Could YR be used to predict Unemployment?


Here are the regression results:
DEP VAR:      UN      N:      10   MULTIPLE R: 0.654  SQUARED MULTIPLE R: 0.428
ADJUSTED SQUARED MULTIPLE R: 0.357     STANDARD ERROR OF ESTIMATE:        0.774

  VARIABLE    COEFFICIENT    STD ERROR     STD COEF TOLERANCE    T    P(2 TAIL)

CONSTANT            1.673        0.529        0.000      .       3.166    0.013
      YR            0.208        0.085        0.654     1.000    2.447    0.040


                             ANALYSIS OF VARIANCE

   SOURCE   SUM-OF-SQUARES    DF  MEAN-SQUARE     F-RATIO       P

 REGRESSION          3.586     1        3.586       5.989       0.040
   RESIDUAL          4.790     8        0.599

--------------------------------------------------------------------------------------
Notice that YR is a reliable predictor of UN with a PRE = .43

To examine the data further, let's look at the ERROR, or the residuals, remaining from this prediction. It is easy to produce the residuals using Statlets

Using the menus Models/Regression/Multiple Regression simply click the Report tab, then the Options button and finally select the Residuals option.
Note that the caution for using the Multiple Regression procedure for solving bivariate regression problems detailed above does not effect the calculation of residuals.
You should see the following output.
Regression results for UN
----------------------
                      
Row           Residual
----------------------
1              1.21818
2            -0.190303
3            -0.598788
4            -0.907273
5             0.484242
6            -0.224242
7            -0.532727
8            -0.441212
9               1.1503
10           0.0418182
----------------------
If you want (and we will want) you can add these values to the original data set.

Your residuals should look like the variable UNOYR in the data set. We have named this variable using the same symbolism the Judd and McClelland text uses on page 191. Below is Statlet's data page showing this data.


Notice how these values are identical to those in the text except for the number of places displayed.

What does UNOYR tell us? Looking at the particular values of UNOYR tells us when, relative to our MODEL, UN is unexpectedly large or small. For example, in the first year, the unemployment was 3.1 million persons. The residual of 1.218 tell us that this value was 1.218 million higher than expected given the regression equation.

Another way to think about these residuals is that they are the part of the original data that can not be predicted with YR. If any other variable is to be useful for making conditional predictions in a multiple regression, then that variable must be able to predict these residuals.

It makes sense to reexamine the original question about the relationship between unemployment and industrial production. Although IP was not a useful predictor by itself of UN, maybe IP is useful for predicting when UN is higher or lower than expected relative to the model of yearly changes.

It is important to make the distinction that asking whether a predictor variable is useful by itself is a very different question from asking whether it is useful after controlling for one or more other variables.

We are now asking whether IP is a useful predictor of UN after controlling for YR or after statistically removing the effects of YR.

Equivalently, we could say we are testing to see whether IP reduces error in predicting UN over and above the reduction achieved by using YR.

You could proceed directly to the solution, except that the coefficients would be hard to interpret because while YR has been removed from UN in the UNOYR residuals, YR is still confounded with IP (the two predictors are correlated).

To see if the nonredundant part of IP can predict the UNOYR scores, we need to remove YR from the IP scores. To do this simply regress IP on YR and save those residuals. Then copy those residuals to your original data file. In your text as well as in our data set they are named IPOYR.

The data set scrolled so that the variable IPOYR is shown and is reproduced again directly below.

Note that when YR is used to predict IP, that there is an error on page 193 in your text. the coefficient is 4.36 instead of 4.75.

Now we can ask a critical question of whether IPOYR reduces the error in predicting UNOYR relative to a simple model for UNOYR. Maybe when unemployment is unexpectedly high, industrial production is unexpectedly low? Note that this is a still more sophisticated version of our original question, which simply asked if there was a relationship between unemployment and industrial production without any reference to expectations.

UNOYR tells us when unemployment is unexpectedly high or low given YR.

IPOYR tell us when industrial production is unexpectedly high or low given YR.

The effects of YR have been removed from both UN and IP. If we look at the relationship between UNOYR and IPOYR, we have a purer look at the relationship between UN and IP with the confounding effects of YR eliminated. Statisticians often use the words controlled for.

Look at this scattergram produced by Statlets. What do you think the answer to the question is?


It certainly looks like a relationship exists. To answer this question statistically, we simply use the procedures we already know. However, there are two small complications.

The first complication is that we know a priori that the means of the residuals are zero because the sums of residuals must be zero in a least squares model. Therefore, the constant or intercept must also be zero. So we won't need to estimate it.

The second complication is that UNOYR is used data. We already estimated two parameters to produce it. Therefore, instead of thinking of it as having an n of 10, we must think of it as having an n of 8.

MODEL C: UNOYR = 0 + ei
MODEL A: UNOYR = 0 + ß1IPOYR +ei
PC = 0 and PA = 1 and n = 8

While your text mentions that most regression programs have an option that allows estimation of the regression equation without estimating the intercept, it will not matter. The SSE Residual will be the same, the PRE will be the same, and the coefficient b1 will be the same regardless of whether you estimate the coefficient or not. The F value will be incorrect because the df used are incorrect. You will need to calculate your own.

Here are the results of Statlet's Simple Regression output using the Summary and ANOVA tabs.

Regression Analysis for UNOYR versus IPOYR
-----------------------------------------------------------------------
Model type: Linear
-----------------------------------------------------------------------
Equation: UNOYR = -1.03515E-17 - 0.103313*IPOYR
-----------------------------------------------------------------------
Coefficient     Estimate        Std. Error      t-value         P-value
-----------------------------------------------------------------------
Intercept       -1.03515E-17    0.118652        -8.7242E-17     1.0
Slope           -0.103313       0.0202566       -5.1002         9.0E-4
-----------------------------------------------------------------------
Correlation = -0.8745
R-squared = 76.48%
Std. error of est. = 0.375211


-------------------------------------------------------------------------
                           Analysis of Variance
-------------------------------------------------------------------------
Source         Sum of Squares   Df    Mean Square    F-Ratio      P-Value
-------------------------------------------------------------------------
Model          3.66207          1     3.66207        26.0121      9.0E-4
Residual       1.12627          8     0.140783
-------------------------------------------------------------------------
Total (Corr.)  4.78834          9

Note the agreement with your text on the PRE and the coefficient of -.103.

To calculate the same F (with 1 and 7 degrees of freedom) you will need to use the hand calculation formula. You have PRE for this model.

The conclusion from this analysis as it might be stated in a journal article could be:
Over and above the yearly changes or, controlling for year, unemployment decreases, on average, by .103 million workers for each unit increase in the index of industrial production.

This conclusion contrasts sharply with the result of our first simple regression. The yearly changes were suppressing the relationship between IP and UN.

Variables which mask or suppress the simple relationship between other variables are known as suppressor variables.


Working Easy

What would happen if we just did a Multiple Regression with Statlets.
Here are the Results

Multiple Regression Analysis
---------------------------------------------------------------------------
Dependent variable: UN
---------------------------------------------------------------------------
                                       Standard          T
Parameter               Estimate         Error       Statistic      P-Value
---------------------------------------------------------------------------
CONSTANT                 13.4539        2.48385           5.42       0.0010
IP                     -0.103339      0.0216551          -4.77       0.0020
YR                      0.659417       0.104305           6.32       4.0E-4
---------------------------------------------------------------------------
 
                           Analysis of Variance
---------------------------------------------------------------------------
Source         Sum of Squares     Df    Mean Square    F-Ratio      P-Value
---------------------------------------------------------------------------
Model                 7.24976    2.0        3.62488      22.53       3.0E-4
Residual              1.12624    7.0       0.160891
---------------------------------------------------------------------------
Total (Corr.)           8.376    9.0
 
R-squared = 86.554 percent
R-squared (adjusted for d.f.) = 82.7123 percent
Standard error of est. = 0.401112
Coeff. of variation = 14.2238 percent
Mean absolute error = 0.271585
Durbin-Watson statistic = 1.32794


REGRESSION EQUATION IS:

Remember that the PRE values can be found from the F values and to find the F values we can square the t values. The formula for PRE is:

The interpretation of a parameter estimate in multiple regression is the same as the interpretation we developed regressing residuals on one another.

Clearly and fortunately we do not need to do the laborious series of simple regressions because we get the same information from a single multiple regression analysis.


Not in the Text

Remember that the textbook complains about the Partial correlations reported by SAS. Compare the PRE values we calculated with the PARTIAL CORR TYPE II reported on page 197 - our PRE values are the same. Also, the partial correlation is simply the square root of these PRE values. So the partial correlations between UN and IP would be the square root of(.765) = .875. Interested students can calculate the correlation between UN and IP with YR partialed out (correlate the two residuals). You will find that the Pearson correlation is -.875. The partial correlation between UN and YR would equal square root of(.851) = .922. This value is the correlation between the residuals of UN and YR with IP partialed from each.


Odds & Ends

Multiple Regression is the most widely used and most widely abused statistical technique in the social sciences.