READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
 
EDF 6937-01       SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
EXERCISE 2
THE MODEL SELECTION (HILOG) LOGLINEAR SPSS PROGRAM
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

PLEASE NOTE: Be sure to examine the program variables for your model very carefully. I have created some new recodes for this (and possible future) assignments. The Model Selection program treats the first category in arithmetic sequence of each variable as "high" and the second as "low". So, be sure to use: REDUC2, HISPANIC, GENDER and HOMECOMP for this assignment.

Questions? Send me an email at: slosh@fsu.edu

And, REMEMBER! (Repeat aloud as needed:)
 
 
Big Chi-squares are BAD.
Little Chi-squares are GOOD.

 
YOUR 
VARIABLES
SPECIFICATIONS & PROGRAMS
THE SPSS HILOG PROGRAM
EXAMINING YOUR OUTPUT
ASSIGNMENT QUESTIONS

PLEASE BE CERTAIN TO USE A "FULLSIZE" SPSS VERSION (e.g., 15 or 6.1.3). Studentware type SPSS programs almost certainly will not work with a database this size. Make sure the advanced statistics option in SPSS is installed so that you can use the loglinear programs.
 
 

 
THIS EXERCISE IS DUE THURSDAY MARCH 5 AT CLASS. WE WILL DISCUSS IT BEFORE YOU TURN IN YOUR EXERCISE.

If you are unable to attend class March 5, please see that I receive your exercise by TWO P.M. Thursday March 5.

Please remember THAT I WOULD LIKE HARD COPY!
BE SURE TO INCLUDE HARD COPY OF YOUR COMPUTER OUTPUT!

If necessary you may email me your answers as text in an email. That is OK.
However, you will need to turn in your SPSS output separately either in my Stone mailbox or through a friend.
Or fax to 644-8776. (Attention: Dr. Susan Carol Losh)
Or place in my EPLS mailbox (that's in room 3210 Stone).
Or mail to: Dr. Susan Carol Losh, Educational Psychology and Learning Systems, Florida State University, Tallahassee FL 32306-4453.
 

PURPOSE, VARIABLES AND THE FOUR WAY PERCENTAGE TABLE

The purpose of this exercise is to provide preliminary experience with testing and evaluating basic hierarchical loglinear models. You will analyze four variables using the 6187 valid WEIGHTED cases from the 1999, 2002 and 2006 NSF Surveys of Public Understanding of Science and Technology "BigDigital" file which examines personal computer and Internet access and use. The data are provided for you on the CD. You will use the "Model Selection" program in SPSS. The four variables in your model are:

GENDER: 1 = Male  2 = Female

HISPANIC (or Latino background):  1= Yes  2 = No

REDUC2: Coded 1 = the respondent has at least a BA degree or 2 = the respondent has a junior college degree or less

HOMECOMP: Coded 1 = the respondent has a home personal computer or 2=s/he does not have a home computer

This 2 X 2 X 2 X 2 percentage table for your viewing information looks as follows:

BACCALAUREATE OR MORE

HISPANIC BACKGROUND YES NO
GENDER MALE FEMALE     MALE FEMALE  
HAS HOME COMPUTER
92.9%
83.3%
9
 
85.9%
81.3%
1199
OTHER
7.1
16.7
61
 
14.1
18.7
236
100.0%
28
100.0%
42
 

70
 
100.0%
708
100.0%
727
 

1435

JUNIOR COLLEGE OR LESS

HISPANIC BACKGROUND YES NO
GENDER MALE FEMALE     MALE FEMALE  
HAS HOME COMPUTER
49.5%
43.2%
192
 
52.8%
53.9%
2282
OTHER
50.5
56.8
224
 
47.2
46.1
1989
100.0%
194
100.0%
222
 

416
 
100.0%
1874
100.0%
2397
 

4271

Source: NSF Surveys of Public Understanding of Science and Technology, 1999, 2002 and 2006, Directors: Jon D. Miller and Linda Kimmel; Opinion Research Corporation/MACRO/NORC General Social Survey. available n = 6192 (weighted cases)
 
 

Some researchers find that USA Hispanics are less likely to own a PC (and therefore go on the Internet less) than USA Anglos. Is this difference really a function of education? (Or just sampling error?) Do women and men differ? Do any of these interact with an admittedly crude education measure? Analysis will tell!

Take a few minutes to study these percentage results. The results from this percentage table can be particularly useful in answering question 10 below.


 

 

ASSIGNMENT GENERAL SPECIFICATIONS AND THE SPSS MODEL SELECTION PROGRAM

The 1999, 2002 and 2006 samples of the NSF Surveys of Public Understanding of Science and Technology were telephone random digit dial surveys with random selection within households.* These are probability samples of the lower 48 United States.

*(To be compatible with the earlier data, the 2006 data use the cell and landline subsample [~95%] of the face to face surveys.)

This exercise uses the "Model Selection" program in SPSS. This "HILOG" program (short for HIerarchical LOGlinear) estimates ONLY hierarchical loglinear models. With the Model Selection or "HILOG" program , you only specify the higher order terms and the program will fill in the rest.

The HILOG program will only provide the numeric lambda general cell frequencies parameters for the fully saturated model. It also does not estimate the grand mean (equiprobable model) term. Whichever of the loglinear, ordinal, or logistic regression packages you use, now or later, you really need to read the output VERY carefully to ensure that you estimated the model that you thought you did and to understand the output terms.

In this exercise you will evaluate a four variable model. You will interpret statistics from the saturated model. Then you will examine the diverse backwards elimination models to estimate the best model to use with the data. There will be some divergence between what the computer program will tell you is the best fitting model under the default conditions and some of our class material. Be sure to go with the in-class material in this case! What the program will do is use a "data druger" mechanical means to make its decision--and we hope that human judgement will take more of the nuances into account. For example, we use a 0.20 probability cutoff and the program will use a 0.05 cutoff for its default. (NOTE: On the front of the Model Selection menu,change the "probability for removal" from 0.05 to 0.20!) This would mean that some models that may "underfit" when reproducing the observed table could "slide through" and meet the SPSS criteria for a good model because they have probability levels of over .05 but less than 0.20.

You will examine the likelihood ratio Chi-squared statistics (G2), the degrees of freedom, probability levels and the Z-values, the standardized scores that correspond to the numeric lambda estimates for the saturated model. All of these results should give you some clues about the best model for the data. In the process you will evaluate several different models for the data in the questions at the end of this website.

You will will also write out the numeric general cell frequency equation using the lambdas for the saturated loglinear model that describes these four variables. Then, you will interpret the findings in this 4-way table in words. The percentage tables above should help you with the verbal descriptions.
 


RUNNING YOUR SPSS HIERARCHICAL LOGLINEAR PROGRAM

Open the SPSS program.
In the initial window that opens in the middle of the page, you will need to click "OK".
When the "Open File" window opens in the middle of the page, you will need to reach the CD rom drive.
Then double click on the CD rom window to load the BigDigitalC8306.sav file on your CD into SPSS.
This is a big file so give SPSS a minute to load all the data.

Under the Analyze menu at the top of the page, select Descriptive Statistics, then select Frequencies
Select the variables: GENDER, HISPANIC, HOMECOMP and REDUC2.
Click OK to run the frequencies for these four variables.
Make sure that all the missing data is, in fact, coded as missing (including any system missing "sysmis" values) and that the percentages shown are ONLY for valid cases.
There are a total of 6192 valid WEIGHTED cases (6187 unweighted cases). These are available for the years 1999, 2002 and 2006. Glance at the bottom of the SPSS data viewer screen to make sure that it says "weight on".

(Ethnicity is only available from 1999 on; by including HISPANIC in the analysis, that is what also selects the three data years we are analyzing.)

Now, print the Frequencies tables for your output to turn in to me.

Under the Analyze menu at the top of the page, select Loglinear, then select Model Selection...

For factors, select REDUC2 (range 1-2), HISPANIC (range 1-2), GENDER (range 1-2), and HOMECOMP (range 1-2). You must click on the Define Range box each time you enter a variable, and enter the minimum and maximum values for each variable.

Click on Model.
You will start with the saturated model already chosen for you, so click Continue.

Click on the Options... button.
Left click to select Parameter estimates and Association table.
Click Continue.
Remember to change the probability for removal on the front menu to 0.20.
Now click OK to run the program. 



EXAMINING YOUR OUTPUT

Well, doing the computer program is one thing, but being able to view ALL of your output is something else again. If you use either SPSS 15 or later, or SPSS 6.13 all your output should be present and easily visible. I do not recommend using versions of SPSS earlier than 15 which can present complications in viewing the output.

Be sure to print your loglinear "Model Selection" program output to turn in to me with your assignment.

The Model Selection Program gives you a  diversity  of output.

"Tests that K-way and higher order effects are zero" is a BACKWARDS elimination algorithm. (HILOG does not do  a forward elimination algorithm.) This is a set of HIERARCHICAL TESTS.

The "Tests that K-way effects are zero" is NOT a forward algorithm. It is actually a non-hierarchical set of tests, e.g., it eliminates all the marginal effects but preserves the two way associations and the interaction terms. This is kind of a strange option and one which may not be very useful to you (unless you have, perhaps, equal cell size experiments or other non-hierarchical designs).

With the "Tests of PARTIAL associations", HILOG eliminates one term at a time and assesses the result. This section is not hierarchical either, but both this and some aspects of the "K-way effects" non-hierarchical sections are very helpful in deciding which, if any, effects can be eliminated from your model.

The " Estimates for Parameters" section is also very helpful. Not only will it give you the loglinear parameters you need to write the numeric equation for the saturated model (ONLY) (see the example in the Exercise 1 Feedback) but the standardized Z-values can assist you in selecting the most parsimonious model that can still explain the data. Pay attention to Z-values with an absolute value of |1.50| or larger. These parameters generally cannot be dropped.
 


ASSIGNMENT QUESTIONS

1. Your SPSS FREQUENCIES output and your MODEL SELECTION ("HILOG") output

(2 points) Although your output does not have a large weight for the exercise, you must turn it in. That way, if you have made any mistakes on the rest of the assignment, I can check these back against your output to give you more credit for understanding the material.

PLUS YOUR ANSWERS TO QUESTIONS 2 - 10 BELOW:

2. (2 points) What is the G2, degrees of freedom and p-level for the hierarchical model that incorporates all the three-way interaction effects?

3. (2 points) Do your results suggest that the four variable interaction term can be deleted from a well-fitting model?
BRIEFLY, give the rationale behind your decision of why or why not.

4. (2 points) Based on the results, does it appear that any of the three-way interaction terms must be retained in order for the model to fit well?
If so, which interaction term(s) must be retained? (List all that apply.)
BRIEFLY give the rationale behind your decision.

5. (2 points) Based on the results, does it appear that any of the two-way association terms can be dropped from the model and yet the model will still fit the data well?
If so, which two way association(s) could be dropped? (List all that apply.)
BRIEFLY give the rationale behind your decision.

6. (2 points) Using the Z-score values associated with each parameter, which parameters look like they may be dropped and yet the model will still fit well? (List all that apply.)

7. (4 points) Use either text or class terminology to describe the model that you believe has the best fit.
Choose among the models in your output only.

How many degrees of freedom are in this model?
SHOW how you obtained the degrees of freedom.
What was the G2 for this model?
What was the p-level for the model you selected?
Briefly describe the rationale for your choice of this model and the results that support it.
PLEASE USE THE IN-CLASS CRITERIA (p > .20) FOR USING P-LEVELS TO SUPPORT A FINAL MODEL.

8. (2 points) Using your results, write out the loglinear equation WITH NUMBERS for the saturated model.
(NOTE: Hilog does not give the grand mean or   effect. For this assignment, it is OK to either put the Greek letter theta as a place holder or simply to eliminate it. Review the Exercise 1 feedback again if needed for an example of a loglinear equation with numbers.)

Use the lambdas from the "parameter estimates" section of  your output.

Be sure to label the variables in your equation. You can assign them the letters A, B,C and D as long as you provide the variable names that accompany each of the letters. You can also assign the variable's descriptive letters, e.g., G, E, H or PC.

9. (1 point) Using the symbols (i.e.,  and  ), write out the loglinear equation for the model that corresponds to the model you believe has the best fit.

(Recall that the HILOG program only generates the loglinear equation for the saturated model, although it will generate the degrees of freedom and G2 for a wide variety of hierarchical loglinear models. Therefore you can't use the parameter numbers if you have dropped any terms at all, since these will change somewhat with each new model. However, you CAN use the correct lambas with superscripts and/or subscripts).

10. (1 point) IN WORDS, briefly describe the results as implied by your best fitting model. This means talking about the associations and possible interactions among the variables in words, not presenting numeric loglinear results or symbols. Imagine that you are describing the results in a non-technical fashion to a colleague or at a conference.
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
Susan Carol Losh
February 25 2009