THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA EXERCISE 2 THE MODEL SELECTION (HILOG) LOGLINEAR SPSS PROGRAM Susan Carol Losh Department of Educational Psychology and Learning Systems Florida State University 
PLEASE NOTE: Be sure to examine the program variables for your model very carefully. I have created some new recodes for this (and possible future) assignments. The Model Selection program treats the first category in arithmetic sequence of each variable as "high" and the second as "low". So, be sure to use: REDUC2, HISPANIC, GENDER and HOMECOMP for this assignment.
Questions? Send me an email at: slosh@fsu.edu
And, REMEMBER! (Repeat aloud as needed:)

VARIABLES 




PLEASE BE CERTAIN TO USE A "FULLSIZE" SPSS VERSION
(e.g., 15 or 6.1.3). Studentware
type SPSS programs almost certainly will not work with a database this
size. Make sure the advanced statistics option in SPSS is installed
so that you can use the loglinear programs.


The purpose of this exercise is to provide preliminary experience with testing and evaluating basic hierarchical loglinear models. You will analyze four variables using the 6187 valid WEIGHTED cases from the 1999, 2002 and 2006 NSF Surveys of Public Understanding of Science and Technology "BigDigital" file which examines personal computer and Internet access and use. The data are provided for you on the CD. You will use the "Model Selection" program in SPSS. The four variables in your model are:
GENDER: 1 = Male 2 = Female
HISPANIC (or Latino background): 1= Yes 2 = No
REDUC2: Coded 1 = the respondent has at least a BA degree or 2 = the respondent has a junior college degree or less
HOMECOMP: Coded 1 = the respondent has a home personal computer or 2=s/he does not have a home computer
This 2 X 2 X 2 X 2 percentage table for your viewing information looks as follows:
BACCALAUREATE OR MORE
HISPANIC BACKGROUND  YES  NO 
GENDER  MALE  FEMALE  MALE  FEMALE 
HAS HOME COMPUTER 
92.9%

83.3%

9

85.9%

81.3%

1199


OTHER 
7.1

16.7

61

14.1

18.7

236


100.0%
28 
100.0%
42 
70 
100.0%
708 
100.0%
727 
1435 
JUNIOR COLLEGE OR LESS
HISPANIC BACKGROUND  YES  NO 
GENDER  MALE  FEMALE  MALE  FEMALE 
HAS HOME COMPUTER 
49.5%

43.2%

192

52.8%

53.9%

2282


OTHER 
50.5

56.8

224

47.2

46.1

1989


100.0%
194 
100.0%
222 
416 
100.0%
1874 
100.0%
2397 
4271 
Source: NSF Surveys of Public Understanding
of Science and Technology, 1999, 2002 and 2006, Directors: Jon D. Miller
and Linda Kimmel; Opinion Research Corporation/MACRO/NORC General Social
Survey. available n = 6192 (weighted cases)
Some researchers find that USA Hispanics are less likely to own a PC (and therefore go on the Internet less) than USA Anglos. Is this difference really a function of education? (Or just sampling error?) Do women and men differ? Do any of these interact with an admittedly crude education measure? Analysis will tell! 

The 1999, 2002 and 2006 samples of the NSF Surveys of Public Understanding of Science and Technology were telephone random digit dial surveys with random selection within households.* These are probability samples of the lower 48 United States.
*(To be compatible with the earlier data, the 2006 data use the cell and landline subsample [~95%] of the face to face surveys.)
This exercise uses the "Model Selection" program in SPSS. This "HILOG" program (short for HIerarchical LOGlinear) estimates ONLY hierarchical loglinear models. With the Model Selection or "HILOG" program , you only specify the higher order terms and the program will fill in the rest.
The HILOG program will only provide the numeric lambda general cell frequencies parameters for the fully saturated model. It also does not estimate the grand mean (equiprobable model) term. Whichever of the loglinear, ordinal, or logistic regression packages you use, now or later, you really need to read the output VERY carefully to ensure that you estimated the model that you thought you did and to understand the output terms.
In this exercise you will evaluate a four variable model. You will interpret statistics from the saturated model. Then you will examine the diverse backwards elimination models to estimate the best model to use with the data. There will be some divergence between what the computer program will tell you is the best fitting model under the default conditions and some of our class material. Be sure to go with the inclass material in this case! What the program will do is use a "data druger" mechanical means to make its decisionand we hope that human judgement will take more of the nuances into account. For example, we use a 0.20 probability cutoff and the program will use a 0.05 cutoff for its default. (NOTE: On the front of the Model Selection menu,change the "probability for removal" from 0.05 to 0.20!) This would mean that some models that may "underfit" when reproducing the observed table could "slide through" and meet the SPSS criteria for a good model because they have probability levels of over .05 but less than 0.20.
You will examine the likelihood ratio Chisquared statistics (G^{2}), the degrees of freedom, probability levels and the Zvalues, the standardized scores that correspond to the numeric lambda estimates for the saturated model. All of these results should give you some clues about the best model for the data. In the process you will evaluate several different models for the data in the questions at the end of this website.
You will will also write out the numeric
general cell frequency equation using the lambdas for the saturated
loglinear model that describes these four variables. Then, you will
interpret the findings in this 4way table in words. The percentage
tables above should help you with the verbal descriptions.

Open the SPSS program.
In the initial window that opens in the
middle of the page, you will need to click "OK".
When the "Open File"
window opens in the middle of the page, you will need to reach the CD rom
drive.
Then double click
on the CD rom window to load the BigDigitalC8306.sav file on your
CD into SPSS.
This is a big file so give SPSS a minute
to load all the data.
Under the Analyze menu at
the top of the page, select Descriptive Statistics, then
select Frequencies
Select the variables:
GENDER,
HISPANIC, HOMECOMP and REDUC2.
Click OK to run the frequencies for these
four variables.
Make sure that all the missing data is,
in fact, coded as missing (including any system missing "sysmis" values)
and that the percentages shown are ONLY for valid
cases.
There are a total
of 6192 valid WEIGHTED cases (6187 unweighted cases). These are available
for the years 1999, 2002 and 2006. Glance at the bottom of the SPSS data
viewer screen to make sure that it says "weight on".
(Ethnicity is only available from 1999 on; by including HISPANIC in the analysis, that is what also selects the three data years we are analyzing.)
Now, print the Frequencies tables for your output to turn in to me.
Under the Analyze menu at the top of the page, select Loglinear, then select Model Selection...
For factors, select REDUC2 (range 12), HISPANIC (range 12), GENDER (range 12), and HOMECOMP (range 12). You must click on the Define Range box each time you enter a variable, and enter the minimum and maximum values for each variable.
Click on Model.
You will start with the saturated model
already chosen for you, so click Continue.
Click on the Options... button.
Left click to select Parameter
estimates and Association table.
Click Continue.
Remember to change the probability for
removal on the front menu to 0.20.
Now click OK to run the program.

Well, doing the computer program is one thing, but being able to view ALL of your output is something else again. If you use either SPSS 15 or later, or SPSS 6.13 all your output should be present and easily visible. I do not recommend using versions of SPSS earlier than 15 which can present complications in viewing the output.
Be sure to print your loglinear "Model Selection" program output to turn in to me with your assignment.
The Model Selection Program gives you a diversity of output.
"Tests that Kway and higher order effects are zero" is a BACKWARDS elimination algorithm. (HILOG does not do a forward elimination algorithm.) This is a set of HIERARCHICAL TESTS.
The "Tests that Kway effects are zero" is NOT a forward algorithm. It is actually a nonhierarchical set of tests, e.g., it eliminates all the marginal effects but preserves the two way associations and the interaction terms. This is kind of a strange option and one which may not be very useful to you (unless you have, perhaps, equal cell size experiments or other nonhierarchical designs).
With the "Tests of PARTIAL associations", HILOG eliminates one term at a time and assesses the result. This section is not hierarchical either, but both this and some aspects of the "Kway effects" nonhierarchical sections are very helpful in deciding which, if any, effects can be eliminated from your model.
The " Estimates for Parameters" section
is also very helpful. Not only will it give you the loglinear parameters
you need to write the numeric equation for the saturated model (ONLY)
(see the example in the Exercise
1 Feedback) but the standardized Zvalues can assist you in selecting
the most parsimonious model that can still explain the data. Pay attention
to Zvalues with an absolute value of 1.50 or larger. These parameters
generally cannot be dropped.

1. Your SPSS FREQUENCIES output and your MODEL SELECTION ("HILOG") output
(2 points) Although your output does not have a large weight for the exercise, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output to give you more credit for understanding the material.
PLUS YOUR ANSWERS TO QUESTIONS 2  10 BELOW:
2. (2 points) What is the G^{2}, degrees of freedom and plevel for the hierarchical model that incorporates all the threeway interaction effects?
3. (2 points) Do your results suggest that
the four variable interaction term can be deleted from a wellfitting model?
BRIEFLY, give the rationale behind your
decision of why or why not.
4. (2 points) Based on the results, does
it appear that any of the threeway interaction terms must be retained
in order for the model to fit well?
If so, which interaction term(s) must
be retained? (List all that apply.)
BRIEFLY give the rationale behind your
decision.
5. (2 points) Based on the results, does
it appear that any of the twoway association terms can be dropped from
the model and yet the model will still fit the data well?
If so, which two way association(s) could
be dropped? (List all that apply.)
BRIEFLY give the rationale behind your
decision.
6. (2 points) Using the Zscore values associated with each parameter, which parameters look like they may be dropped and yet the model will still fit well? (List all that apply.)
7. (4 points) Use either text or class
terminology to describe the model that you believe has the best fit.
Choose among the models in your output
only.
How many degrees of freedom are in this
model?
SHOW how you obtained the degrees of
freedom.
What was the G^{2} for this model?
What was the plevel for the model you
selected?
Briefly describe the rationale for your
choice of this model and the results that support it.
PLEASE USE THE INCLASS CRITERIA (p
> .20) FOR USING PLEVELS TO SUPPORT A FINAL MODEL.
8. (2 points) Using your results, write
out the loglinear equation WITH NUMBERS for the saturated
model.
(NOTE: Hilog does not give the grand mean
or effect. For this
assignment, it is OK to either put the Greek letter theta as a place holder
or simply to eliminate it. Review the Exercise
1 feedback again if needed for an example of a loglinear equation
with numbers.)
Use the lambdas from the "parameter estimates" section of your output.
Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variable's descriptive letters, e.g., G, E, H or PC.
9. (1 point) Using the symbols (i.e., and ), write out the loglinear equation for the model that corresponds to the model you believe has the best fit.
(Recall that the HILOG program only generates the loglinear equation for the saturated model, although it will generate the degrees of freedom and G^{2 }for a wide variety of hierarchical loglinear models. Therefore you can't use the parameter numbers if you have dropped any terms at all, since these will change somewhat with each new model. However, you CAN use the correct lambas with superscripts and/or subscripts).
10. (1 point) IN WORDS, briefly describe
the results as implied by your best fitting model. This means talking
about the associations and possible interactions among the variables in
words, not presenting numeric loglinear results or symbols. Imagine
that you are describing the results in a nontechnical fashion to a colleague
or at a conference.

OVERVIEW 


This page created with Netscape
Composer
Susan Carol Losh
February 25 2009