See updates for the future.
Click HERE
|
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA EXERCISE 2 FEEDBACK 20 POINTS TOTAL BASICS: TESTING HIERARCHICAL MODELS AND WRITING EQUATIONS Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University |
|
|
|
|
Please read over this material very carefully.
Remember that I do not answer questions about PERSONAL papers in class
(including break). We can speak after class or in an appointment. Thanks!
This exercise provided practice with testing
hierarchical models with the likelihood ratio G2 statistic and
the SPSS "Model Selection" or "HILOG" program, as well as basic GCF loglinear
equations. Again, we are starting out basic; each of the four variables
in this model had two categories. We used the 6192
valid WEIGHTED cases from the 1999, 2002 and 2006 NSF Surveys of Public
Understanding of Science and Technology "BigDigital" file which examines
personal computer and Internet access and use. Here are the variables:
GENDER:
1
= Male 2 = Female
HISPANIC
(or
Latino background): 1= Yes 2 = No
REDUC2:
Coded 1 = the respondent has at least a BA degree or 2 = the respondent
has a junior college degree or less
HOMECOMP:
Coded 1 = the respondent has a home personal computer or 2=s/he does not
have a home computer
These codes are important. I chose--or
in some cases reversed--the category values because the HILOG program counts
the first category (whatever it is, including 0) as "High". Thus,
if you found a negative
coefficient between gender and education, this would mean that females
(coded 2) had more education (coded 0) than males (coded 1).
I have reproduced the table again below
with
the calculation of an additional--but quite telling--percentage:
BACCALAUREATE OR MORE
| HISPANIC BACKGROUND | YES | NO |
| GENDER | MALE | FEMALE | MALE | FEMALE |
| HAS HOME COMPUTER |
92.9%
|
83.3%
|
61 (87%)
|
85.9%
|
81.3%
|
1199 (84%)
|
|
| OTHER |
7.1
|
16.7
|
9
|
14.1
|
18.7
|
236
|
|
|
100.0%
28 |
100.0%
42 |
70 |
100.0%
708 |
100.0%
727 |
1435 |
JUNIOR COLLEGE OR LESS
| HISPANIC BACKGROUND | YES | NO |
| GENDER | MALE | FEMALE | MALE | FEMALE |
| HAS HOME COMPUTER |
49.5%
|
43.2%
|
192 (46%)
|
52.8%
|
53.9%
|
2282 (53%)
|
|
| OTHER |
50.5
|
56.8
|
224
|
47.2
|
46.1
|
1989
|
|
|
100.0%
194 |
100.0%
222 |
416 |
100.0%
1874 |
100.0%
2397 |
4271 |
Source: NSF Surveys of Public Understanding of Science and Technology, 1999, 2002 and 2006, Directors: Jon D. Miller and Linda Kimmel; Opinion Research Corporation/MACRO/NORC General Social Survey. available n = 6192 (weighted cases)
Why is the added % "telling"? It shows what the HISPANIC*EDUCATION*PC interaction looks like.
Before you do any more complicated runs, you need to look the 4-way table over and gain a sense of what it is saying:
For
individuals with at least a baccalaureate, Hispanic background makes very
little difference in owning a PC. However, among those with less education,
Hispanics are 7% less likely than Non-Hispanics to own a PC.
Gender
only makes a difference for Hispanics with at least a BA. This would imply
a 4 variable interaction effect--except we all know that there was not
a four variable interaction in these data.
In any event, your answer to question 10 had to tackle these findings in some way and the percentage tables are probably the easiest for a novice reader to grasp. You can also run the three-way tables using the SPSS Crosstabs routine.
When you present papers at a conference or write an article, you need to tell your reader or listener what you found in words about correlations and interactions. (A lot of reader never read the tables.) It's not enough just to say: we have a three way interaction effect among Hispanic Background, degree level and owning a personal computer; we need to describe what this means as above.
|
|
We are at a good pace and that we are staying relatively close together in understanding the material! I'll go over the assignment and "sticky wickets" below.
Make sure you include the variable labels
for each lambda coefficient. You need to do so, just as you do when
you write out a multiple regression equation. Male sure you include ALL
the
s, Omitting a
or two was one of the most common errors (and misreading the output was
a closely related error!)
For example, in our four variable model, the saturated GCF equation is where:
E = degree level H = Hispanic G = gender and PC = Owns a PC
Gijk =
+
iE
+
kH
+
kG
+
kPC
+
ijEH
+
ikEG
+
jkEPC
+
ijHG
+
ikHPC
+
jkGPC
+
ijkEHG
+
ijkEHPC+
ijkEGPC+
ijkHGPC+
ijkEHGPC
or, in numbers (the question asked for the numbers in the saturated model):
Gijk =
- .892E - 1.363H - .156G +.447PC
-.201EH - .063EG + .450EPC - .060HG
+ .006HPC + .092GPC - .089EHG +
.076EHPC +.067EGPC + .055HGPC+ .019EHGPC
Be sure you are comfortable with the coefficents. Remember how they are coded! For example, since females are the second code "2" and the program treats the first code as "high," the -.156 G coefficient means there are somewhat more females (coded 2) than males (coded 1). The positive GPC coefficient means there are somewhat more males (coded 1) who own a home computer (coded 1) and more females (coded 2) who don't own a computer (coded 2). In other words, "high" gender goes with "high" PC ownership and the coefficient is positive. The 3-way and up interactions are trickier to interpret and that's when you need those percentage tables. One example is the positive education-gender-PC coefficient (+.067) meaning that males with at least a baccalaureate more often own a PC than females with at most a junior college degree.
NOTE: You only need to go out to 3 decimal places. More than that makes your equations very difficult to read. The program will use the extended decimal places as needed.
NOTE:
The label for each
coefficient
in the HILOG output is ABOVE the numeric coefficient (not below it).
COMMON NOVICE MISTAKES: Forgetting to include all the lower order terms if you are using a hierarchical model when writing out the loglinear equation. Do get used to writing out all the lower order terms embedded in a hierarchical model. It will make running programs other than HILOG much easier.
Ensuring you enter all the terms you planned for a non-hierarchical model.
This part of the exercise in writing out terms will help you later when you run later programs.
NOTE:
p is virtually never .0000! SPSS and other programs truncate the
level at a predetermined decimal point. When you see p = .000 this really
means p < .001. p "=" .0000 means p < .0001.
INTERPRET THE K-WAY TABLE CORRECTLY: In the first k-way table, you are told the G2 and probability levels for k and up associations and interactions being deleted from the model. For example, the line reading k=4 is the test for OMITTING the 4-way interaction term. The remaining model incorporates all 3-way interactions and lower-order terms.
If the Chi-square for the k=4 line is large and the probability level is small, the model does not fit. You must return the 4-way interaction to the model in order for it to fit the observed data correctly.
k =3 means that you deleted all the 3 way plus the 4 way interaction effects. If the Chi-square for the k = 3 line is large and the probability level is small, a model that omits all 3-ways and the 4-way interaction does not fit. If the p-level on the k=4 line is large, you can omit the 4-way interaction but must return at least one 3-way interaction to the model for it to fit. You may not need to return all the 3-way interactions to make the model fit (in this example, you could omit the Hispanic*Gender*Home Computer 3-way interaction.)
LARGE chi-squares
mean that you omitted a vital lambda term that must be returned to the
model.
SMALL chi-squares
typically mean the model fits. Perhaps you can omit other terms as well.
|
|
USE THE TABLE ABOVE TO HELP ANSWER THE
FOLLOWING QUESTIONS.
PLEASE BE SURE TO ANSWER ALL DESIGNATED
PARTS OF A QUESTION.
1. Your SPSS FREQUENCIES output and your MODEL SELECTION ("HILOG") output (2 points)
Although your output does not have a large weight, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output.
PLUS YOUR ANSWERS TO QUESTIONS 2 - 10 BELOW:
2. (2 points) What is the G2, degrees of freedom and p-level for the hierarchical model that incorporates all the three-way interaction effects?
This is the model which omits ONLY the four-way interaction term. All three-way effects (and lower order terms) are present. You can locate this set of results from the k = 4 line of the first k-way table and these are:
Likelihood ratio chi square = 0.289 DF = 1 P = .591
3. (2 points) Do your results suggest that
the four variable interaction term can be deleted from a well-fitting model?
BRIEFLY, give the rationale behind your
decision.
Yes, the four-way interaction term can be deleted. The G2 is quite small (.289) and the probability of this model exceeds 0.20. The expected values from this all 3-way model are very close to the values in the observed four-way table.
4. (2 points) Based on the results, does
it appear that any of the three-way interaction terms must be retained
in order for the model to fit well?
If so, which interaction term(s) must
be retained?
BRIEFLY give the rationale behind your
decision.
The k = 3 line indicates that a model that deletes all 3-way effects and the 4-way interaction has a very large chi-square and a low probability level (G2 = 14.67 DF = 5 P = .012). This means that AT LEAST one of the 3-way interactions must be retained in order for the observed and expected values in the 4-way table to match within sampling error. A careful look at the partial associations suggests that one possible candidate to drop would be the HISPANIC*GENDER*HOMECOMP interaction. In the partial associations table, this would generate a G2 = 2.56 DF = 2 P = .1099. However, this probability level is above our typical cutoff value of 0.20, indicating that a model omitting this three-way interaction could generate expected values that may be different from the observed results in the table.
There's only one way to find out! Go ahead and test that model. In fact I ran a few simpler models to try out (see the link below). Some of them flopped miserably although the initial diagnostic statistics suggested that they might work.
Finally I went ahead and ran a model deleting the 4-way interaction and the HISPANIC*GENDER*HOMECOMP 3-way interaction. That generated a G2 = 2.84 DF = 2 P = .241, which fits the data quite nicely.
It is easy to go ahead and test these models sequentially. Follow my testing in more detail HERE.
The G2
associated with the k-way table essentially represents a type of "average"
chi-square per degrees of freedom. Therefore,
it is not necessarily the most accurate way to make decisions on any particular
interaction effect or partial correlation. In conjunction with the partial
associations and the
parameter
standardized Z-scores, however, the k-way table can be very helpful in
making decisions about a final model.
Examining the
s
brings some other insights. For example,
they suggest we might want to re-estimate using a non-hierarchical model.
There isn't much of a two-way association between H*G (z = 1.173) or H*PC
(z = 0.110).
However, we need
to consider all the information together,
and at least sometimes, in light of the study design. HILOG runs hierarchical
models, so we can't test dropping some of these coefficients. Gender and
ethnicity were weighted to reflect the U.S. adult population . So the GENDER
* HISPANIC low
value is
not surprising. A non-hierarchical model dropping this term could probably
work.
|
5. (2 points) Based on the results, does
it appear that any of the two-way association terms can be dropped from
the model and yet the model will still fit the data well?
If so, which two way association(s) could
be dropped?
BRIEFLY give the rationale behind your
decision.
Using the lambdas and Z-scores, the three possible candidates are: REDUC2*GENDER (Z = -1.23), HISPANIC*GENDER (Z =-1.17), and HISPANIC*HOMECOMP (Z = 0.11).
Using the partial
association table
Dropping the
Education by Gender correlation results in a G2 = 8.63
DF = 1 P = .003
Dropping the
Hispanic by Gender correlation results in a G2 = 0.16
DF = 1 P = .694
Dropping the
Hispanic by Home PC correlation results in a G2
= 6.22 DF = 1 P = .013.
The only consistent one that might be dropped is the Hispanic * Gender two-way (see above).
What about the others?
Why the discrepancies? The E * G and H * PC are both buried in significant
3-way interactions. Just a dropping a lower order term in ANOVA can produce
some strange results when an associated higher order interaction is statistically
significant, we need to be careful shifting to a non-hierarchical model.
6. (2 points) Using the Z-values
associated with each parameter, which parameters look like they may be
dropped and yet the model will still fit well? (A bit redundant at this
point but here goes:)
E = Education
H = Hispanic G = Gender PC = Owns home computer
|
|
|
|
| {EHGPC} |
|
|
| {EHPC} |
|
|
| {EGPC} |
|
|
| {HGPC} |
|
|
| {EG}* |
|
|
| {HG}* |
|
|
| {HPC}* |
|
|
*(necessitates nonhierachical model)
7. (4 points) Use either Gilbert or class
terminology to describe the model that you believe has the best fit.
Choose among the models in your output
only. (REMEMBER THIS ONE! HOWEVER, YOU COULD TEST OTHER HIERARCHICAL MODELS
AND OBTAIN THE CHI-SQUARES, DF AND P-LEVELS THROUGH FORMAL TESTING.)
How many degrees of freedom are in this
model?
SHOW how you obtained the degrees of
freedom.
What was the G2 for this model?
What was the p-level for the model you
selected?
Briefly describe the rationale for your
choice of this model and the results that support it.
PLEASE USE THE IN-CLASS CRITERIA FOR USING
P-LEVELS TO SUPPORT A FINAL MODEL.
See question 4.
Testing really
is the only way!
8. (2 points) Using your results, write
out the loglinear equation WITH NUMBERS for the saturated model.
(NOTE: Hilog does not give the grand mean
or
effect. For this
assignment, it is OK to either put the Greek letter theta as a place holder
or simply to eliminate it.)
Use the lambdas from the "parameter estimates" section of your output.
Be
sure to label the variables in your equation. You
can assign them the letters A, B,C and D as long as you provide the variable
names that accompany each of the letters. You can also assign the variables
descriptive letters, e.g., E, H, G or PC.
Gijk =
- .892E - 1.363H - .156G +.447PC
-.201EH - .063EG + .450EPC - .060HG
+ .006HPC + .092GPC - .089EHG +
.076EHPC +.067EGPC + .055HGPC+ .019EHGPC
9. (1 point) Using the symbols (i.e.,
and
), write out the loglinear
equation for the model that corresponds to the model you believe has the
best fit.
(Recall that the HILOG program only generates the loglinear equation for the saturated model, although it will generate the degrees of freedom and G2 for a wide variety of hierarchical loglinear models. Therefore you can't use the parameter numbers if you have dropped any terms at all, since these will change somewhat with each new model.) EGPC
Gijkl =
+
iE
+
jH+
kG
+
lPC
+
ijEH+
ikEG
+
ilEPC
+
jkHG
+
jlHPC
+
klGPC
+
ijkEHG
+
ijlEHPC
+
iklGPC
10. (1 point) IN WORDS, briefly describe the results as implied by your best fitting model. This means talking about the associations and possible interactions among the variables, not presenting numeric loglinear results or symbols. Imagine that you are describing the results in a non-technical fashion to a colleague or at a conference.
Here's where it is helpful to go back to the percentage table and perhaps even calculate a few three-way percentages. Back check your results with the ones that are statistically significant in your output.
·
Men more often own a home PC (although there’s no sex difference in the
re-estimated model)
· So do
those with more education
(Both parameters are positive)
· Hispanics have less education than non-Hispanics (the parameter is negative)
· Non-Hispanic
males are slightly more likely to have a college BA than non-Hispanic females
· There’s
a larger educational difference in owning a home computer for Hispanics
(a 41% difference) than for non-Hispanics (a 30% difference)
· Low
education non-Hispanics more often own a PC than low education Hispanics
![]() |
OVERVIEW |
|
|
This page created with Netscape
Composer
and is best viewed with
Netscape Navigator
600 X 800 display resolution.
Susan Carol Losh
March 18 2009