THIS EXERCISE IS DUE BY CLASS THURSDAY APRIL 2.
|
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA EXERCISE 3: THE SPSS GENERAL LOGLINEAR PROGRAM 20 points Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
NUANCES |
THE TABLE |
|
"BEST MODEL" |
|
|---|
You will use the Loglinear Model Selection and General SPSS
Programs to:
(A) run
the saturated model for the four-way
(1)
reduc2 by (2) gender by (3) hihsmath (2
levels of high school math achievement--read on!)by
(4) homecomp (own a home pc) then
(B)
run and test the model you believe has the best fit.
You
will write out the loglinear equations for for your
best-fitting model, then describe your model in words suitable (say) for
a conference presentation.
|
|
I RECOMMEND THE FOLLOWING TO MAKE YOUR VARIABLE CHOICES EASIER TO FIND. You really DO want to know what the numeric values are for the categories in each variable because this could influence any recoding decisions that you make. Changing the Data General defaults means that you often can find your variables for the statistical maneuvers much faster. In the SPSS menu at the top of the screen, go to the “Options” choice under the “Edit” menu. |
Keep your model selection program output in front of you as you use the
GENERAL program. It's easier to use the copious output about models and
partial associations in the MODEL SELECTION package to identify your best
model than to tediously try them out one at a time in the GENERAL program.
You will need the bigdigitalC8306.sav data file Exercise 3.
|
|
Your GENERAL computer program output will look different from the MODEL SELECTION output--although the conclusions about your final model should remain about the same. However, the values of the estimators will be different:
First,
you are now including the estimate of the constant term in your equation
with the GENERAL program.
Second,
the MODEL SELECTION program uses "deviation estimates" or contrast coding
(1, 0 and -1). The default for dichotomous variables in this case is 1,
-1 (NOT 1,0). The GENERAL program uses "indicator estimates" or dummy
coding (0, 1).
Third,
check the variable codings very carefully to see which is the referent
category and which is the omitted category. Make sure the variables are
coded the way you wanted them to be.
Fourth,
both programs add a small amount (typically 0.5) to each cell for the saturated
model. This will change the expected frequencies at least slightly.
Fifth,
the algorithms used for computation to calculate the coefficients and
model are slightly different for each program (IPF* for MODEL SELECTION
and Newton-Raphson for GENERAL).
*IPF = Iterative proportional fitting algorithm
Because of the constraints that lambda and beta coefficients sum to zero within a particular variable, or across variable combinations (e.g., Reduc2-Homecomp), parameter estimator coefficients are not independent of each other. This will be especially true when you use the multinomial distribution which assumes that n is fixed to be the sample casebase (although with a saturated model, don't expect differences between multinomial and poisson). MODEL SELECTION simply omitted the extraneous estimator coefficients entirely, trusting that you would know how to calculate them (particularly easy with contrast coding and dichotomous variables--these tend to be mirror images). GENERAL lists the parameters, but places zeros on the parameters that are not estimated independently of one another.
However, you can still use the Z-scores to guide you about which interaction terms or pairwise correlations you can drop.
When you create your GENERAL program, be sure to add the lower order terms to your program model that are hierarchically nested within the higher order interactions. The GENERAL program uses a hierarchical algorithm to calculate the G2 statistic and degrees of freedom. However, the parameter estimates change dramatically (and perhaps nonsensically) depending on which terms you include or omit while model building. The parameter estimates do not behave in a hierarchical fashion in the General program.
It is, of course, a different case if you have definite grounds to omit certain lower order terms. Perhaps you have an experimental design or a disproportionate sampling design that will automatically make some lower order terms or pairwise correlations to be zero. Since these effects really ARE zero, however, doing so shouldn't change your estimates of the other parameters.
However, notice that nuances such as sampling design differ from partial effects or correlations that become zero when you control for other variables! Remember that it is often necessary to include "main effects" when you have higher order interactions, whether you are doing an Analysis of Variance or a Loglinear Analysis. Not to do so is to often credit the interaction effects with more influence than they actually have.
If a partial effect (e.g., the gender marginal) becomes zero when other variables are controlled, you should probably include this term when you creat your loglinear equation.
But do remember about the possibility of
changing the probability level to include or omit a lambda or beta parameter
because you can change entry criteria with the Logistic Regression
package. You can also set the confidence interval to 0.80 instead of 0.95.
This will generate a narrower confidence interval than the default 0.95
confidence interval. If the narrower 0.80 interval does not contain zero,
you should probably retain that parameter in the model.
|
|
This exercise uses results from five of the National Science Foundation surveys: 1990; 1995; 1997; 1999; 2006 and there are a total of 8772 weighted cases. What happened to the other cases? Well, NSF didn't ask about high school math achievement until 1990 but it didn't include the computer questions in 1992. The 2002 data are from the General Social Survey (telephone subsample) before NSF routed the NSF Surveys of Public Understanding of Science and Technology through there; the GSS only collects detailed data on educational achievement when it collects the NSF Surveys module. By 2006, the NSF Surveys were being collected through the General Social Survey.
The four variable table below is a new one. It substitutes the highest level of high school math for Hispanic background (in two categories, 2 years of high school algebra or more versus 1 year of algebra or less; this will necessitate a recode into a new variable, see below). You'll want to study this table and maybe do some crosstabulation runs and/or calculate some percentages to see how pairwise correlations and/or interaction effects may operate because you will need to summarize your results in words again at the end of exercise 3.
AT LEAST A BA DEGREE
| MATH LEVEL | 2 YEARS H.S. ALGEBRA OR MORE | 1 YEAR H.S. ALGEBRA OR LESS |
| GENDER | MALE | FEMALE | MALE | FEMALE |
| HAS HOME COMPUTER |
73.9%
|
69.1%
|
1256
|
60.7%
|
59.2%
|
112
|
|
| EVERYONE ELSE |
26.1
|
30.9
|
497
|
39.3
|
40.8
|
75
|
|
|
100.0%
942 |
100.0%
811 |
1753 |
100.0%
84 |
100.0%
103 |
187 |
JUNIOR COLLEGE OR LESS
| MATH LEVEL | 2 YEARS H.S. ALGEBRA OR MORE | 1 YEAR H.S. ALGEBRA OR LESS |
| GENDER | MALE | FEMALE | MALE | FEMALE |
| HAS HOME COMPUTER |
49.4%
|
46.0%
|
1829
|
31.0%
|
30.7%
|
922
|
|
| EVERYONE ELSE |
50.6
|
54.0
|
2014
|
69.0
|
69.3
|
2067
|
|
|
100.0%
1827 |
100.0%
2016 |
3843 |
100.0%
1273 |
100.0%
1716 |
2989 |
Source: NSF Surveys of Public Understanding
of Science and Technology, 1990, 1995, 1997, 1999 and 2006, Directors:
Jon D. Miller and Linda Kimmel; Opinion Research Corporation/MACRO, General
Social Survey; available n = 8772 (weighted data).
|
|
Load your copy of the BIGDIGITALC8306.sav file into the SPSS Data Editor.
The first thing is to collapse the highest level of high school math into two categories: "2 years of algebra or more" (category 1) and "1 year of algebra or less" (2).
Find the himath variable and click
on the top of the himath column so that the column is highlighted.
On the top menu, click on the add variable
icon.
SPSS will add a column headed with "var00001"
to the left of the himath variable.
Relabel var00001 as "hihsmath".
(You can use the Define Variable Properties menu under the SPSS
top Data menu.)
Under Transform in the top SPSS
menu, click on "Recode into Different Variables"
The Input Variable is "himath".
The Output Variable (under Name)
is "hihsmath". (Click on "Change")
Then click on old and new values.
Click on "Range".
The range will go from "4" through "9".
Make that new value "1" and click "Add"
in the large box.
The second range will go from "0" through
"3".
It will have the new value "2" (click
"Add").
Then click "Continue".
Then click OK.
Click on the "hihsmath" data column
to highlight the column.
Under Data in the top SPSS menu
click "Define Variable Properties".
Add value labels (under Labels) if you
like.
|
In the MODEL SELECTION program, run the following (add in order) under the saturated model:
Reduc2, Hihsmath, Gender and Homecomp
Each variable has the values 1 and 2.
Request the parameter estimates
and the partial association table.
Change the probability level on the opening
Model Selection menu to 0.20 if you like.
Print the output and examine the models.
Still in MODEL SELECTION, try out and test
your best model(s).
Under "Model" change the option from
Saturated to Custom.
Since MODEL SELECTION is a hierarchical
program, you only need to include the higher order terms for this program.
Test your best model and note the G2,
df and significance level.
If you like, try out a few other models
too. (Maybe your first one wasn't "the best" after all...)
Under the SPSS Analyze program, go to the Loglinear section and click on General...
In this order, enter the variables:
reduc2 hihsmath gender homecomp
into the Factor(s): box.
Leave the distribution of cell counts on Poisson (for all dichotomous variables and a fully saturated model, your conclusions will essentially be the same as if you had used the Multinomial distribution).
Click on the Model... box.
Leave the radio button choice on Saturated for this first run experience, then click the Continue box.
Click on the Options... box.
Click to put a check mark on the Estimates
box.
GENERAL allows you to obtain the estimated
parameters for any model, saturated or not.
If you like: Change the Confidence Interval:
box to 80 (%)
Click on the Continue box.
Then click OK.
|
|
You will have A LOT of output.
Depending on what you are interested in,
you may wish to delete some of it (not for this assignment, however) when
you work with loglinear models in data analysis.
First, you receive a lot of information about your program run. Always check the model in the GENERAL program (the saturated model should be OK but it's a good habit to cultivate with any of these programs) to make sure that you included the terms in the model that you intended to include. You will find the model at the very beginning of the program output after the "Factors" list.
Make sure the casebase (including weighted, which is used in the analysis), the variables, the way the variables are coded corresponds to what you believe you had entered into the analysis.
For example, make sure the number of categories (levels) for each variable corresponds to your knowledge of that variable. Too many categories could mean that you inadvertently included cases that had missing data codes or even "wild punches" as substantively real values. If so, you would want to use SPSS provisions to make sure these category values are correctly coded as missing the next time you analyze the data. (Did you run the frequencies on each of your four variables first? Especially "himath" and the recoded "hihsmath".)
For your saturated model, the Chi-square should be zero with zero degrees of freedom. The "statistical significance" level is considered undefined in this case so you will simply see a dot under significance.
Next, you receive the observed and model cell counts for each of the 16 cells in the table.
In a saturated model, the observed and expected cell counts will be identical. However, once you drop terms, these will diverge. In models that are not saturated, you will want to examine the cell count residuals, and especially the adjusted cell count residuals, for clues about where the model may not fit. Larger adjusted residuals (over 1.50) should alert you to possible terms that need to be placed back in the model so that the model accurately reflects your observed data table.
You will next examine A LOT of parameter estimates. It's not as bad as it looks, however, because redundant parameters, that could be calculated through a combination of marginal totals and independent parameters, have been omitted by setting them to zero.
Finally, you will see a gigantic table(s)
that presents the covariances and correlations among the parameter estimates.
This usually is not as interesting in terms of the substance of your results
and we won't really examine these in this assignment.
|
|
This time, you will use the General program to build, test, and obtain the parameter estimates for your best-fitting model. Use your saturated model MODEL SELECTION and GENERAL runs to decide which terms to incorporate in your best-fitting model.
Once again, you will use the BIGDIGITALC8306.SAV file.
In this order, enter the variables:
reduc2 hihsmath gender homecomp
into the Factor(s): box.
Leave the distribution of cell counts on Poisson.
Click on the Model box.
HERE'S WHERE THE DIRECTIONS WILL CHANGE!
Click on Custom to move the radio button choice.
I suggest you start with the "Main effects" (single variable marginals) and work forward from there in constructing your model.
So, put the Build Term(s) box on "Main
effects".
Click on the variables in the Factors
& Covariates: box to highlight them where you want to preserve
the marginal distributions (typically, that is all of them unless your
experimental or sampling design dictates otherwise).
Notice that you can click and highlight
more than one variable at a time. (Hold down the "Ctrl" key on your keyboard
to do this.)
So, for example, if you want to have all
the single-variable effects in your model, you can click on all of them
in the Factors & Covariates: box to highlight them all the variables
at once, then click the
button to pull all of the variables over into the Terms in Model:
box.
In the Build Term(s) box decide
what you will do next with the 2-way variable associations or correlation
coefficients. If you want ALL the pairwise correlations, it's easiest
just to highlight all the variables in the Factors & Covariates:
box, select "All 2-way" in the Build Term(s) box and click the
button. The program will then place all possible two way associations in
the Terms in Model: box. By extension, you can also build in all
3-way interactions, 4-way interactions, and so forth. Of course, this would
simply rebuild the saturated model if you included all the possible terms
(in this case, up to the 4-way interaction).
However, your best model may not be the saturated model. In fact, in this particular example, your best model probably will be simpler, even if just a little simpler. Typically, three-way interactions are relatively rare and 4-way or higher more unusual still.
To include ONLY the 2-way interactions
you want, change the box to read "Interaction." Highlight the variable
pairs in the Factors & Covariates: box two at a time, then click
on the arrow to put the correlations in the Terms in Model: box.
For example, if you wanted to include the gender*homecomp correlation,
highlight both gender and homecomp, then click the
button. The 2-way association will appear in the Terms in Model:
box.
Continue entering the two-way terms until you have entered all the ones
that you want to include in your best-fitting model.
To include ONLY the 3-way interactions
you want, keep the box on "Interaction" and highlight the variable
triplets
in the Factors & Covariates: box three at a time, then
click on the arrow to put the triplets in the Terms in Model:
box.
For example, if I wanted to include the 3-way interaction gender*reduc2*homecomp,
I would highlight gender, reduc2 and homecomp, then click the
button. The program will now paste the gender*reduc2*homecomp term into
the Terms in Model: box.
Continue pasting the terms you want into the Terms in Model: box until you are satisfied you have included all the terms in the model that you want. Remember to include lower order terms if appropriate. Then click the Continue box.
NOTE: You
may find it easier to use the expressions "All 2-way" "All 3-way" (etc),
highlight all the variables and let the program build the terms for you.
Then just use the
back arrow to omit the terms that you don't want to use in the model.
Click on the Options... box.
Click to put a check mark on the Estimates
box.
Notice that you can now obtain the adjusted
residuals and their corresponding normal scores.
Since you probably will not have a saturated model, keep these check marks (although they add to an already lengthy output) because if your new model doesn't fit, the normal residuals in particular will give you clues to what your results should look like. Also you will need them for the Exercise 3 questions.
Click on the Continue box.
NOW click OK.
|
|
Once again, you have a lot of output. Double check the variable categories, the n, and other data information. Double check the model to ensure that you included ALL the terms that you wanted.
Check to be sure that the function converged in your model. The program initially allows 20 iterations of the function. Under Options...you can raise that number if you need to. I have almost never seen that to be necessary. However, if the estimates did not converge for your particular model, you will have strange and unreliable output. So give a quick glance at the convergence box to make sure before you continue. If the number of iterations is less than or equal to 20, you are fine.
Check the Goodness-of-Fit Tests box for the Chi-square, degrees of freedom and probability level for your best-fitting model. You will use these results to justify why you chose the model that you did.
Chi-square, degrees of freedom, and probability levels are always positive entities. If any of these quantities are negative, something is wrong!
The Adjusted Residuals which compare the expected and observed frequencies for each cell in the table are also helpful in selecting a final model.
The program gives you the parameter estimates for your model. Notice that only the terms you included when you created your custom model box are shown for the parameter estimates.
Once again, the program generates the correlations and covariances among all the parameters in the model you created.
|
|
NOTE: The little red balls
are scattered throughout to help you remember to answer all parts of each
question.
1.
Your
SPSS FREQUENCIES output and your MODEL
SELECTION AND
GENERAL output (2 points)
Although your output does not have a large weight, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output.
PLUS YOUR ANSWERS TO QUESTIONS 2 - 9 BELOW:
2. (3 points)
Based
on the saturated model MODEL SELECTION output, which marginal, 2-way,
3-way, or 4-way terms look like they can be dropped?
Why?
3. (2 points)
Briefly
use loglinear terminology to describe your best-fitting model.
After
you have written the abbreviation for your best-fitting model, please be
sure to include all the lower order terms that are needed so that the model
is a hierarchical one.
4. (2 points)
How
many degrees of freedom are in this
best-fitting
model?
What
was the G2 and the associated p-level
for
the model you selected?
5. (3 points)
Do
you consider your final model "overfitted," "under fitted" or "just right".
Briefly
defend your choice.
6. (3 points)
Using
your GENERAL results, write out the
loglinear
equation
WITH NUMBERS
that corresponds to the model you believe has the best
fit..
Be
sure to label the variables in your equation. You can assign them the
letters A, B,C and D as long as you provide the variable names that accompany
each of the letters. You can also assign the variables descriptive letters,
e.g., G, M, ED or PC.
7. (2 points)
What
was the largest residual in your best fitting model (i.e., which cell in
the table did this residual correspond to?)
What
was the size of its associated standardized or normal residual?
Given
the size of the normal residual, was this further evidence about the fit
of your model?
8. (1 point)
Using
your results from your best fitting model, draw a brief causal diagram
sketch of how you think the variables Reduc2, Gender, Hihsmath and owning
a PC (homecomp) all work together. (Note: there may be some assumptions
about your best-fitting model buried here. See if you can catch them!)
9. (2 points)
IN WORDS, briefly describe the results as implied by your best fitting
model. This means discussing the associations and possible interactions
among the variables, not presenting numeric loglinear results or symbols.
Imagine
that you are describing the results in a non-technical fashion to a colleague
at a conference who is not familiar with categorical data analysis.
![]() |
READINGS
|
|
|
This page created with Netscape
Composer
Susan Carol Losh
March 25 2009