GLORY BE! At last the SPSS General program works! You can do your computer runs on either 6.1.3 or on SPSS 15+

THIS EXERCISE IS DUE BY CLASS THURSDAY APRIL 2.
READINGS

GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW


 

EDF 6937-01       SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
EXERCISE 3: THE SPSS GENERAL LOGLINEAR PROGRAM
20 points
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
PROGRAM 
NUANCES
REVIEW 
THE TABLE
PROGRAM 1 THE SATURATED MODEL
GENERAL--YOUR 
"BEST MODEL"
ASSIGNMENT QUESTIONS

You will use the Loglinear Model Selection and General SPSS Programs to:

(A) run the saturated model for the four-way (1) reduc2 by (2) gender by (3) hihsmath (2 levels of high school math achievement--read on!)by (4) homecomp (own a home pc) then
(B) run and test the model you believe has the best fit.

You will write out the loglinear equations for  for  your best-fitting model, then describe your model in words suitable (say) for a conference presentation.
 
 
 
IMPORTANT NOTE #1: Please include both your MODEL SELECTION AND your GENERAL output when you turn in Exercise 3.

 
 
IMPORTANT NOTE #2: LOOK AT EVERYTHING WHEN YOU DETERMINE A FINAL MODEL! The MODEL SELECTION output, the GENERAL output, both sets of Zs, k-way and partial association tables, chi-squares partitioned with the associated df, etc.!

Make sure that you TEST your final model! (You might want to test a couple of other models for comparison purposes too.)


 
ON MAKING LIFE EASIER IN SPSS
I RECOMMEND THE FOLLOWING TO MAKE YOUR VARIABLE CHOICES EASIER TO FIND. You really DO want to know what the numeric values are for the categories in each variable because this could influence any recoding decisions that you make. Changing the Data General defaults means that you often can find your variables for the statistical maneuvers much faster.
In the SPSS menu at the top of the screen, go to the “Options” choice under the “Edit” menu. 

Under “Data—General” choose the following:
Variable Lists: (1) Display NAMES (2) ALPHABETICAL

Under “Data—Output Labels” choose the following:
Outline Labeling: (1) Variables—NAMES AND LABELS (2) Variable values—VALUES AND LABELS.

Keep your model selection program output in front of you as you use the GENERAL program. It's easier to use the copious output about models and partial associations in the MODEL SELECTION package to identify your best model than to tediously try them out one at a time in the GENERAL program.

You will need the bigdigitalC8306.sav data file Exercise 3.

PROGRAM NUANCES AND DIFFERENCES FROM THE MODEL SELECTION PROGRAM 

Your GENERAL computer program output will look different from the MODEL SELECTION output--although the conclusions about your final model should remain about the same. However, the values of the estimators will be different:

First, you are now including the estimate of the constant term in your equation with the GENERAL program.

Second, the MODEL SELECTION program uses "deviation estimates" or contrast coding (1, 0 and -1). The default for dichotomous variables in this case is 1, -1 (NOT 1,0). The GENERAL program uses "indicator estimates" or dummy coding (0, 1).

Third, check the variable codings very carefully to see which is the referent category and which is the omitted category. Make sure the variables are coded the way you wanted them to be.

Fourth, both programs add a small amount (typically 0.5) to each cell for the saturated model. This will change the expected frequencies at least slightly.

Fifth, the algorithms used for computation to calculate the coefficients and model are slightly different for each program (IPF* for MODEL SELECTION and Newton-Raphson for GENERAL).

*IPF = Iterative proportional fitting algorithm


Because of the constraints that lambda and beta coefficients sum to zero within a particular variable, or across variable combinations (e.g., Reduc2-Homecomp), parameter estimator coefficients are not independent of each other. This will be especially true when you use the multinomial distribution which assumes that n is fixed to be the sample casebase (although with a saturated model, don't expect differences between multinomial and poisson). MODEL SELECTION simply omitted the extraneous estimator coefficients entirely, trusting that you would know how to calculate them (particularly easy with contrast coding and dichotomous variables--these tend to be mirror images). GENERAL lists the parameters, but places zeros on the parameters that are not estimated independently of one another.

However, you can still use the Z-scores to guide you about which interaction terms or pairwise correlations you can drop.

When you create your GENERAL program, be sure to add the lower order terms to your program model that are hierarchically nested within the higher order interactions. The GENERAL program uses a hierarchical algorithm to calculate the G2 statistic and degrees of freedom. However, the parameter estimates change dramatically (and perhaps nonsensically) depending on which terms you include or omit while model building. The parameter estimates do not behave in a hierarchical fashion in the General program.

It is, of course, a different case if you have definite grounds to omit certain lower order terms. Perhaps you have an experimental design or a disproportionate sampling design that will automatically make some lower order terms or pairwise correlations to be zero. Since these effects really ARE zero, however, doing so shouldn't change your estimates of the other parameters.

However, notice that nuances such as sampling design differ from partial effects or correlations that become zero when you control for other variables! Remember that it is often necessary to include "main effects" when you have higher order interactions, whether you are doing an Analysis of Variance or a Loglinear Analysis. Not to do so is to often credit the interaction effects with more influence than they actually have.

If a partial effect (e.g., the gender marginal) becomes zero when other variables are controlled, you should probably include this term when you creat your loglinear equation.



Because the GENERAL program does not automatically select good (or otherwise) models, you can't set the entry criterion probability level.

But do remember about the possibility of changing the probability level to include or omit a lambda or beta parameter because you can change entry criteria with the Logistic Regression package. You can also set the confidence interval to 0.80 instead of 0.95. This will generate a narrower confidence interval than the default 0.95 confidence interval. If the narrower 0.80 interval does not contain zero, you should probably retain that parameter in the model.
 

REVIEW THE TABLE

This exercise uses results from five of the National Science Foundation surveys: 1990; 1995; 1997; 1999; 2006 and there are a total of 8772 weighted cases. What happened to the other cases? Well, NSF didn't ask about high school math achievement until 1990 but it didn't include the computer questions in 1992. The 2002 data are from the General Social Survey (telephone subsample) before NSF routed the NSF Surveys of Public Understanding of Science and Technology through there; the GSS only collects detailed data on educational achievement when it collects the NSF Surveys module. By 2006, the NSF Surveys were being collected through the General Social Survey.

The four variable table below is a new one. It substitutes the highest level of high school math for Hispanic background (in two categories, 2 years of high school algebra or more versus 1 year of algebra or less; this will necessitate a recode into a new variable, see below). You'll want to study this table and maybe do some crosstabulation runs and/or calculate some percentages to see how pairwise correlations and/or interaction effects may operate because you will need to summarize your results in words again at the end of exercise 3.

AT LEAST A BA DEGREE

MATH LEVEL 2 YEARS H.S. ALGEBRA OR MORE  1 YEAR H.S. ALGEBRA OR LESS
GENDER MALE FEMALE     MALE FEMALE  
HAS HOME COMPUTER
73.9%
69.1%
1256
 
60.7%
59.2%
112
EVERYONE ELSE
26.1
30.9
497
 
39.3
40.8
75
 
100.0%
942
100.0%
811
 

1753
 
100.0%
84
100.0%
103
 

187

JUNIOR COLLEGE OR LESS

MATH LEVEL 2 YEARS H.S. ALGEBRA OR MORE 1 YEAR H.S. ALGEBRA OR LESS
GENDER MALE FEMALE     MALE FEMALE  
HAS HOME COMPUTER
49.4%
46.0%
1829
 
31.0%
30.7%
922
EVERYONE ELSE
50.6
54.0
2014
 
69.0
69.3
2067
 
100.0%
1827
100.0%
2016
 

3843
 
100.0%
1273
100.0%
1716
 

2989

Source: NSF Surveys of Public Understanding of Science and Technology, 1990, 1995, 1997, 1999 and 2006, Directors: Jon D. Miller and Linda Kimmel; Opinion Research Corporation/MACRO, General Social Survey; available n = 8772 (weighted data).
 


THE SATURATED MODEL: YOUR SPSS MODEL SELECTION AND GENERAL PROGRAMS RUNS

Load your copy of the BIGDIGITALC8306.sav  file into the SPSS Data Editor.

The first thing is to collapse the highest level of high school math into two categories: "2 years of algebra or more" (category 1) and "1 year of algebra or less" (2).

Find the himath variable and click on the top of the himath column so that the column is highlighted.
On the top menu, click on the add variable icon.
SPSS will add a column headed with "var00001" to the left of the himath variable.
Relabel var00001 as "hihsmath". (You can use the Define Variable Properties menu under the SPSS top Data menu.)

Under Transform in the top SPSS menu, click on "Recode into Different Variables"
The Input Variable is "himath".
The Output Variable (under Name) is "hihsmath". (Click on "Change")
Then click on old and new values.
Click on "Range".
The range will go from "4" through "9".
Make that new value "1" and click "Add" in the large box.
The second range will go from "0" through "3".
It will have the new value "2" (click "Add").
Then click "Continue".
Then click OK.

Click on the "hihsmath" data column to highlight the column.
Under Data in the top SPSS menu click "Define Variable Properties".
Add value labels (under Labels) if you like.
 
 
 
Run frequencies on the new "hihsmath" variable  and the old "himath" variable to ensure that missing data for "hihsmath" (coded "." for "SYSMIS"; about 9000 cases) are coded correctly and that your recode worked the way you wanted it to.

In the MODEL SELECTION program, run the following (add in order) under the saturated model:

Reduc2, Hihsmath, Gender and Homecomp

Each variable has the values 1 and 2.
Request the parameter estimates and the partial association table.
Change the probability level on the opening Model Selection menu to 0.20 if you like.
Print the output and examine the models.

Still in MODEL SELECTION, try out and test your best model(s).
Under "Model" change the option from Saturated to Custom.
Since MODEL SELECTION is a hierarchical program, you only need to include the higher order terms for this program.
Test your best model and note the G2, df and significance level.
If you like, try out a few other models too. (Maybe your first one wasn't "the best" after all...)



THE GENERAL PROGRAM
 

Under the SPSS Analyze program, go to the Loglinear section and click on General...

In this order, enter the variables:

reduc2     hihsmath     gender   homecomp

into the Factor(s): box.

Leave the distribution of cell counts on Poisson (for all dichotomous variables and a fully saturated model, your  conclusions will essentially be the same as if you had used the Multinomial distribution).

Click on the Model... box.

Leave the radio button choice on Saturated for this first run experience, then click the Continue box.

Click on the Options... box.
Click to put a check mark on the Estimates box.
GENERAL allows you to obtain the estimated parameters for any model, saturated or not.

If you like: Change the Confidence Interval: box to 80 (%)
Click on the Continue box.
Then click OK.


 
OUTPUT: THE "GENERAL" PROGRAM SATURATED RUN--GET THE PARAMETER ESTIMATES FOR YOUR BEST MODEL

You will have A LOT of output.
Depending on what you are interested in, you may wish to delete some of it (not for this assignment, however) when you work with loglinear models in data analysis.

First, you receive a lot of information about your program run. Always check the model in the GENERAL program (the saturated model should be OK but it's a good habit to cultivate with any of these programs) to make sure that you included the terms in the model that you intended to include. You will find the model at the very beginning of the program output after the "Factors" list.

Make sure the casebase (including weighted, which is used in the analysis), the variables, the way the variables are coded corresponds to what you believe you had entered into the analysis.

For example, make sure the number of categories (levels) for each variable corresponds to your knowledge of that variable. Too many categories could mean that you inadvertently included cases that had missing data codes or even "wild punches" as substantively real values. If so, you would want to use SPSS provisions to make sure these category values are correctly coded as missing the next time you analyze the data.  (Did you run the frequencies on each of your four variables first? Especially "himath" and the recoded "hihsmath".)

For your saturated model, the Chi-square should be zero with zero degrees of freedom. The "statistical significance" level is considered undefined in this case so you will simply see a dot under significance.

Next, you receive the observed and model cell counts for each of the 16 cells in the table.

In a saturated model, the observed and expected cell counts will be identical. However, once you drop terms, these will diverge. In models that are not saturated, you will want to examine the cell count residuals, and especially the adjusted cell count residuals, for clues about where the model may not fit. Larger adjusted residuals (over 1.50) should alert you to possible terms that need to be placed back in the model so that the model accurately reflects your observed data table.

You will next examine A LOT of parameter estimates. It's not as bad as it looks, however, because redundant parameters, that could be calculated through a combination of marginal totals and independent parameters, have been omitted by setting them to zero.

Finally, you will see a gigantic table(s) that presents the covariances and correlations among the parameter estimates. This usually is not as interesting in terms of the substance of your results and we won't really examine these in this assignment.
 


YOUR BEST MODEL: USING THE SPSS GENERAL PROGRAM

This time, you will use the General program to build, test, and obtain the parameter estimates for your best-fitting model. Use your saturated model MODEL SELECTION and GENERAL runs to decide which terms to incorporate in your best-fitting model.

Once again, you will use the BIGDIGITALC8306.SAV file.

In this order, enter the variables:

reduc2     hihsmath     gender   homecomp

into the Factor(s): box.

Leave the distribution of cell counts on Poisson.

Click on the Model box.

HERE'S WHERE THE DIRECTIONS WILL CHANGE!

Click on Custom to move the radio button choice.

I suggest you start with the "Main effects" (single variable marginals) and work forward from there in constructing your model.

So, put the Build Term(s) box on "Main effects".
Click on the variables in the Factors & Covariates: box  to highlight them where you want to preserve the marginal distributions (typically, that is all of them unless your experimental or sampling design dictates otherwise).
Notice that you can click and highlight more than one variable at a time. (Hold down the "Ctrl" key on your keyboard to do this.)

So, for example, if you want to have all the single-variable effects in your model, you can click on all of them in the Factors & Covariates: box to highlight them all the variables at once, then click the  button to pull all of the variables over into the Terms in Model: box.


In the Build Term(s) box decide what you will do next with the 2-way variable associations or correlation coefficients. If you want ALL the pairwise correlations, it's easiest just to highlight all the variables in the Factors & Covariates: box, select "All 2-way" in the Build Term(s) box and click the  button. The program will then place all possible two way associations in the Terms in Model: box. By extension, you can also build in all 3-way interactions, 4-way interactions, and so forth. Of course, this would simply rebuild the saturated model if you included all the possible terms (in this case, up to the 4-way interaction).


However, your best model may not be the saturated model. In fact, in this particular example, your best model probably will be simpler, even if just a little simpler. Typically, three-way interactions are relatively rare and 4-way or higher more unusual still.

To include ONLY the 2-way interactions you want, change the box to read "Interaction." Highlight the variable pairs in the Factors & Covariates: box two at a time, then click on the arrow to put the correlations in the Terms in Model: box. For example, if you wanted to include the gender*homecomp correlation, highlight both gender and homecomp, then click the  button. The 2-way association will appear in the Terms in Model: box. Continue entering the two-way terms until you have entered all the ones that you want to include in your best-fitting model.


To include ONLY the 3-way interactions you want, keep the box on "Interaction" and highlight the variable triplets in the Factors & Covariates: box three at a time, then click on the arrow to put the triplets in the Terms in Model: box. For example, if I wanted to include the 3-way interaction gender*reduc2*homecomp, I would highlight gender, reduc2 and homecomp, then click the  button. The program will now paste the gender*reduc2*homecomp term into the Terms in Model: box.


Continue pasting the terms you want into the Terms in Model: box until you are satisfied you have included all the terms in the model that you want. Remember to include lower order terms if appropriate. Then click the Continue box.


NOTE: You may find it easier to use the expressions "All 2-way" "All 3-way" (etc), highlight all the variables and let the program build the terms for you. Then just use the  back arrow to omit the terms that you don't want to use in the model.


Click on the Options... box.
Click to put a check mark on the Estimates box.
Notice that you can now obtain the adjusted residuals and their corresponding normal scores.

Since you probably will not have a saturated model, keep these check marks (although they add to an already lengthy output) because if your new model doesn't fit, the normal residuals in particular will give you clues to what your results should look like. Also you will need them for the Exercise 3 questions.

Click on the Continue box.
 

NOW click OK.

OUTPUT: THE SPSS GENERAL PROGRAM RUN--"BEST MODEL"

Once again, you have a lot of output. Double check the variable categories, the n, and other data information. Double check the model to ensure that you included ALL the terms that you wanted.

Check to be sure that the function converged in your model. The program initially allows 20 iterations of the function. Under Options...you can raise that number if you need to. I have almost never seen that to be necessary. However, if the estimates did not converge for your particular model, you will have strange and unreliable output. So give a quick glance at the convergence box to make sure before you continue. If the number of iterations is less than or equal to 20, you are fine.

Check the Goodness-of-Fit Tests box for the Chi-square, degrees of freedom and probability level for your best-fitting model. You will use these results to justify why you chose the model that you did.

Chi-square, degrees of freedom, and probability levels are always positive entities. If any of these quantities are negative, something is wrong!

The Adjusted Residuals which compare the expected and observed frequencies for each cell in the table are also helpful in selecting a final model.

The program gives you the parameter estimates for your model. Notice that only the terms you included when you created your custom model box are shown for the parameter estimates.

Once again, the program generates the correlations and covariances among all the parameters in the model you created.

ASSIGNMENT QUESTIONS 

NOTE: The little red balls  are scattered throughout to help you remember to answer all parts of each question.

1. Your SPSS FREQUENCIES output and your MODEL SELECTION AND GENERAL output (2 points)

Although your output does not have a large weight, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output.

PLUS YOUR ANSWERS TO QUESTIONS 2 - 9 BELOW:

2. (3 points) Based on the saturated model MODEL SELECTION output, which marginal, 2-way, 3-way, or 4-way terms look like they can be dropped?
Why?

3. (2 points) Briefly use loglinear terminology to describe your best-fitting model.

After you have written the abbreviation for your best-fitting model, please be sure to include all the lower order terms that are needed so that the model is a hierarchical one.

4. (2 points) How many degrees of freedom are in this best-fitting model?
What was the G2  and the associated p-level for the model you selected?

5. (3 points) Do you consider your final model "overfitted," "under fitted" or "just right".
Briefly defend your choice.

6. (3 points) Using your GENERAL results, write out the loglinear equation WITH NUMBERS that corresponds to the model you believe has the best fit..

Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variables descriptive letters, e.g., G, M, ED or PC.

7. (2 points) What was the largest residual in your best fitting model (i.e., which cell in the table did this residual correspond to?)
What was the size of its associated standardized or normal residual?
Given the size of the normal residual, was this further evidence about the fit of your model?

8. (1 point) Using your results from your best fitting model, draw a brief causal diagram sketch of how you think the variables Reduc2, Gender, Hihsmath and owning a PC (homecomp) all work together. (Note: there may be some assumptions about your best-fitting model buried here. See if you can catch them!)

9. (2 points)  IN WORDS, briefly describe the results as implied by your best fitting model. This means discussing the associations and possible interactions among the variables, not presenting numeric loglinear results or symbols.

Imagine that you are describing the results in a non-technical fashion to a colleague at a conference who is not familiar with categorical data analysis.
 
 
 
READINGS

OVERVIEW

This page created with Netscape Composer
Susan Carol Losh
March 25 2009