.See sequential testing of the four variable model. Click HERE for the Word document

See updates for the future. Click HERE
 
READINGS
GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
 

EDF 6937-01       SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
EXERCISE 2 FEEDBACK
20 POINTS TOTAL
BASICS: TESTING HIERARCHICAL MODELS AND WRITING EQUATIONS
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
 IN GENERAL
POINTS TO PONDER
SPECIFIC ANSWERS

Please read over this material very carefully. Remember that I do not answer questions about PERSONAL papers in class (including break). We can speak after class or in an appointment. Thanks!

This exercise provided practice with testing hierarchical models with the likelihood ratio G2 statistic and the SPSS "Model Selection" or "HILOG" program, as well as basic GCF loglinear equations. Again, we are starting out basic; each of the four variables in this model had two categories. We used the 6192 valid WEIGHTED cases from the 1999, 2002 and 2006 NSF Surveys of Public Understanding of Science and Technology "BigDigital" file which examines personal computer and Internet access and use. Here are the variables:

GENDER: 1 = Male  2 = Female

HISPANIC (or Latino background):  1= Yes  2 = No

REDUC2: Coded 1 = the respondent has at least a BA degree or 2 = the respondent has a junior college degree or less

HOMECOMP: Coded 1 = the respondent has a home personal computer or 2=s/he does not have a home computer

 These codes are important. I chose--or in some cases reversed--the category values because the HILOG program counts the first category (whatever it is, including 0) as "High". Thus, if you found a negative  coefficient between gender and education, this would mean that females (coded 2) had more education (coded 0) than males (coded 1).

I have reproduced the table again below with the calculation of an additional--but quite telling--percentage:
 
 

BACCALAUREATE OR MORE

HISPANIC BACKGROUND YES NO
GENDER MALE FEMALE     MALE FEMALE  
HAS HOME COMPUTER
92.9%
83.3%
61 (87%)
 
85.9%
81.3%
1199 (84%)
OTHER
7.1
16.7
9
 
14.1
18.7
236
100.0%
28
100.0%
42
 

70
 
100.0%
708
100.0%
727
 

1435

JUNIOR COLLEGE OR LESS

HISPANIC BACKGROUND YES NO
GENDER MALE FEMALE     MALE FEMALE  
HAS HOME COMPUTER
49.5%
43.2%
192 (46%)
 
52.8%
53.9%
2282 (53%)
OTHER
50.5
56.8
224
 
47.2
46.1
1989
100.0%
194
100.0%
222
 

416
 
100.0%
1874
100.0%
2397
 

4271

Source: NSF Surveys of Public Understanding of Science and Technology, 1999, 2002 and 2006, Directors: Jon D. Miller and Linda Kimmel; Opinion Research Corporation/MACRO/NORC General Social Survey. available n = 6192 (weighted cases)

Why is the added % "telling"? It shows what the HISPANIC*EDUCATION*PC interaction looks like.

Before you do any more complicated runs, you need to look the 4-way table over and gain a sense of what it is saying:

For individuals with at least a baccalaureate, Hispanic background makes very little difference in owning a PC. However, among those with less education, Hispanics are 7% less likely than Non-Hispanics to own a PC.

Gender only makes a difference for Hispanics with at least a BA. This would imply a 4 variable interaction effect--except we all know that there was not a four variable interaction in these data.

In any event, your answer to question 10 had to tackle these findings in some way and the percentage tables are probably the easiest for a novice reader to grasp. You can also run the three-way tables using the SPSS Crosstabs routine.

When you present papers at a conference or write an article, you need to tell your reader or listener what you found in words about correlations and interactions. (A lot of reader never read the tables.) It's not enough just to say: we have a three way interaction effect among Hispanic Background, degree level and owning a personal computer; we need to describe what this means as above. 



POINTS TO PONDER

We are at a good pace and that we are staying relatively close together in understanding the material! I'll go over the assignment and "sticky wickets" below.

Make sure you include the variable labels for each lambda coefficient. You need to do so, just as you do when you write out a multiple regression equation. Male sure you include ALL the s, Omitting a  or two was one of the most common errors (and misreading the output was a closely related error!)

For example, in our four variable model, the saturated GCF equation is where:

   E = degree level  H = Hispanic  G = gender  and PC = Owns a PC

GijkiEkHkGkPCijEHikEGjkEPCijHGikHPCjkGPCijkEHG ijkEHPCijkEGPCijkHGPCijkEHGPC

or, in numbers (the question asked for the numbers in the saturated model):

Gijk - .892E - 1.363H - .156G +.447PC  -.201EH - .063EG  + .450EPC - .060HG + .006HPC + .092GPC  - .089EHG  + .076EHPC +.067EGPC  + .055HGPC+ .019EHGPC

Be sure you are comfortable with the coefficents. Remember how they are coded! For example, since females are the second code "2" and the program treats the first code as "high," the -.156 G coefficient means there are somewhat more females (coded 2) than males (coded 1). The positive GPC coefficient means there are somewhat more males (coded 1) who own a home computer (coded 1) and more females (coded 2) who don't own a computer (coded 2). In other words, "high" gender goes with "high" PC ownership and the coefficient is positive. The 3-way and up interactions are trickier to interpret and that's when you need those percentage tables. One example is the positive education-gender-PC coefficient (+.067) meaning that males with at least a baccalaureate more often own a PC than females with at most a junior college degree.

NOTE: You only need to go out to 3 decimal places. More than that makes your equations very difficult to read. The program will use the extended decimal places as needed.

NOTE: The label for each  coefficient in the HILOG output is ABOVE the numeric coefficient (not below it).

COMMON NOVICE MISTAKES: Forgetting to include all the lower order terms if you are using a hierarchical model when writing out the loglinear equation. Do get used to writing out all the lower order terms embedded in a hierarchical model. It will make running programs other than HILOG much easier.

Ensuring you enter all the terms you planned for a non-hierarchical model.

This part of the exercise in writing out terms will help you later when you run later programs.

NOTE: p is virtually never .0000! SPSS and other programs truncate the  level at a predetermined decimal point. When you see p = .000 this really means p < .001. p "=" .0000 means p < .0001.

INTERPRET THE K-WAY TABLE CORRECTLY:  In the first k-way table, you are told the G2 and probability levels for k and up associations and interactions being deleted from the model. For example, the line reading k=4 is the test for OMITTING the 4-way interaction term. The remaining model incorporates all 3-way interactions and lower-order terms.

If the Chi-square for the k=4 line is large and the probability level is small, the model does not fit. You must return the 4-way interaction to the model in order for it to fit the observed data correctly.

k =3 means that you deleted all the 3 way plus the 4 way interaction effects. If the Chi-square for the k = 3 line is large and the probability level is small, a model that omits all 3-ways and the 4-way interaction does not fit. If the p-level on the k=4 line is large, you can omit the 4-way interaction but must return at least one 3-way interaction to the model for it to fit. You may not need to return all the 3-way interactions to make the model fit (in this example, you could omit the Hispanic*Gender*Home Computer 3-way interaction.)

LARGE chi-squares mean that you omitted a vital lambda term that must be returned to the model.
SMALL chi-squares typically mean the model fits. Perhaps you can omit other terms as well.
 
 


SPECIFIC ANSWERS TO SPECIFIC QUESTIONS

USE THE TABLE ABOVE TO HELP ANSWER THE FOLLOWING QUESTIONS.
PLEASE BE SURE TO ANSWER ALL DESIGNATED PARTS OF A QUESTION.

1. Your SPSS FREQUENCIES output and your MODEL SELECTION ("HILOG") output (2 points)

Although your output does not have a large weight, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output.

PLUS YOUR ANSWERS TO QUESTIONS 2 - 10 BELOW:

2. (2 points) What is the G2, degrees of freedom and p-level for the hierarchical model that incorporates all the three-way interaction effects?

This is the model which omits ONLY the four-way interaction term. All three-way effects (and lower order terms) are present. You can locate this set of results from the k = 4 line of the first k-way table and these are:

Likelihood ratio chi square =      0.289    DF = 1  P =  .591

3. (2 points) Do your results suggest that the four variable interaction term can be deleted from a well-fitting model?
BRIEFLY, give the rationale behind your decision.

Yes, the four-way interaction term can be deleted. The G2 is quite small (.289) and the probability of this model exceeds 0.20. The expected values from this all 3-way model are very close to the values in the observed four-way table.

4. (2 points) Based on the results, does it appear that any of the three-way interaction terms must be retained in order for the model to fit well?
If so, which interaction term(s) must be retained?
BRIEFLY give the rationale behind your decision.

The k = 3 line indicates that a model that deletes all 3-way effects and the 4-way interaction has a  very large chi-square and a low probability level (G2  = 14.67   DF = 5  P = .012). This means that AT LEAST one of the 3-way interactions must be retained in order for the observed and expected values in the 4-way table to match within sampling error.  A careful look at the partial associations suggests that one possible candidate to drop would be the  HISPANIC*GENDER*HOMECOMP interaction. In the partial associations table, this would generate a G2 = 2.56    DF = 2  P = .1099. However, this probability level is above our typical cutoff value of 0.20, indicating that a model omitting this three-way interaction could generate expected values that may be different from the observed results in the table.

There's only one way to find out! Go ahead and test that model. In fact I ran a few simpler models to try out (see the link below). Some of them flopped miserably although the initial diagnostic statistics suggested that they might work.

Finally I went ahead and ran a model deleting the 4-way interaction and the HISPANIC*GENDER*HOMECOMP 3-way interaction. That generated a G2 = 2.84    DF = 2  P = .241, which fits the data quite nicely.

It is easy to go ahead and test these models sequentially. Follow my testing in more detail HERE.

The G2  associated with the k-way table essentially represents a type of "average" chi-square per degrees of freedom. Therefore, it is not necessarily the most accurate way to make decisions on any particular interaction effect or partial correlation. In conjunction with the partial associations and the  parameter standardized Z-scores, however, the k-way table can be very helpful in making decisions about a final model.

Examining the s brings some other insights. For example, they suggest we might want to re-estimate using a non-hierarchical model. There isn't much of a two-way association between H*G (z = 1.173) or H*PC (z = 0.110).

However, we need to consider all the information together, and at least sometimes, in light of the study design. HILOG runs hierarchical models, so we can't test dropping some of these coefficients. Gender and ethnicity were weighted to reflect the U.S. adult population . So the GENDER * HISPANIC low  value is not surprising. A non-hierarchical model dropping this term could probably work.
 
 
 
NOVICE NOTE: Don't add and subtract the G2s in the partial tables with respect to each other. These are partial coefficients ONLY for that particular term that was omitted. They are NOT hierarchical G2s. The only way to know what the final G2 is, is to run the total model and take the Chi-square and other statistics from that model.

5. (2 points) Based on the results, does it appear that any of the two-way association terms can be dropped from the model and yet the model will still fit the data well?
If so, which two way association(s) could be dropped?
BRIEFLY give the rationale behind your decision.

Using the lambdas and Z-scores, the three possible candidates are: REDUC2*GENDER (Z = -1.23), HISPANIC*GENDER (Z =-1.17), and HISPANIC*HOMECOMP (Z = 0.11).

Using the partial association table
Dropping the Education by Gender correlation results in a G2  = 8.63   DF = 1  P =  .003
Dropping the Hispanic by Gender correlation results in a G2 = 0.16   DF = 1  P =  .694
Dropping the Hispanic by Home PC correlation results in a   G2 = 6.22 DF = 1 P = .013.

The only consistent one that might be dropped is the Hispanic * Gender two-way (see above).

What about the others? Why the discrepancies? The E * G and H * PC are both buried in significant 3-way interactions. Just a dropping a lower order term in ANOVA can produce some strange results when an associated higher order interaction is statistically significant, we need to be careful shifting to a non-hierarchical model.
 


 6. (2 points) Using the Z-values associated with each parameter, which parameters look like they may be dropped and yet the model will still fit well? (A bit redundant at this point but here goes:)

E = Education H = Hispanic G = Gender   PC = Owns home computer
 

Parameter
Lambda value
Z-Score
{EHGPC}
0.019
0.367
{EHPC}
0.076
1.483
{EGPC}
0.066
1.286
{HGPC}
0.055
1.079
{EG}*
-0.063
-1.228
{HG}*
-0.060
-1.173
{HPC}*
0.006
0.110

*(necessitates nonhierachical model)

7. (4 points) Use either Gilbert or class terminology to describe the model that you believe has the best fit.
Choose among the models in your output only. (REMEMBER THIS ONE! HOWEVER, YOU COULD TEST OTHER HIERARCHICAL MODELS AND OBTAIN THE CHI-SQUARES, DF AND P-LEVELS THROUGH FORMAL TESTING.)
How many degrees of freedom are in this model?
SHOW how you obtained the degrees of freedom.
What was the G2 for this model?
What was the p-level for the model you selected?
Briefly describe the rationale for your choice of this model and the results that support it.
PLEASE USE THE IN-CLASS CRITERIA FOR USING P-LEVELS TO SUPPORT A FINAL MODEL.

See question 4.
Testing really is the only way!

8. (2 points) Using your results, write out the loglinear equation WITH NUMBERS for the saturated model.
(NOTE: Hilog does not give the grand mean or   effect. For this assignment, it is OK to either put the Greek letter theta as a place holder or simply to eliminate it.)

Use the lambdas from the "parameter estimates" section of  your output.

Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variables descriptive letters, e.g., E, H, G or PC.
 

Gijk - .892E - 1.363H - .156G +.447PC  -.201EH - .063EG  + .450EPC - .060HG + .006HPC + .092GPC  - .089EHG  + .076EHPC +.067EGPC  + .055HGPC+ .019EHGPC

9. (1 point) Using the symbols (i.e.,  and  ), write out the loglinear equation for the model that corresponds to the model you believe has the best fit.

(Recall that the HILOG program only generates the loglinear equation for the saturated model, although it will generate the degrees of freedom and G2 for a wide variety of hierarchical loglinear models. Therefore you can't use the parameter numbers if you have dropped any terms at all, since these will change somewhat with each new model.)  EGPC

GijkliEjHkGlPCijEHikEGilEPCjkHG jlHPC klGPC ijkEHG ijlEHPC iklGPC

10. (1 point) IN WORDS, briefly describe the results as implied by your best fitting model. This means talking about the associations and possible interactions among the variables, not presenting numeric loglinear results or symbols. Imagine that you are describing the results in a non-technical fashion to a colleague or at a conference.

Here's where it is helpful to go back to the percentage table and perhaps even calculate a few three-way percentages. Back check your results with the ones that are statistically significant in your output.

 · Men more often own a home PC (although there’s no sex difference in the re-estimated model)
· So do those with more education

(Both parameters are positive)

· Hispanics have less education than non-Hispanics (the parameter is negative)

· Non-Hispanic males are slightly more likely to have a college BA than non-Hispanic females
· There’s a larger educational difference in owning a home computer for Hispanics (a 41% difference) than for non-Hispanics (a 30% difference)
· Low education non-Hispanics more often own a PC than low education Hispanics
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
and is best viewed with Netscape Navigator
600 X 800 display resolution.
Susan Carol Losh
March 18 2009