and UPDATED

Please see the top of this guide for more updated odds and ends!
 
READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
 

EDF 6937-01       SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 8: ON LOGITS, LAMBDA, and OTHER "GENERAL" THOUGHTS
EXERCISES PAPERS ETC ETC
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
UPDATES
GENERAL THOUGHTS
MULTNOMIAL EQUATIONS
ON CAUSALITY
INTERPRETATION

 
UPDATES FROM APRIL 16th

This is a NEW section, incorporating information from "trials runs," explanations, and class projects. Please read through carefully.

First, thanks to Mike, Brandon, Nate, Song-Il and Dan, for sharing their data, frustrating SPSS experiences, and findings with us. It may not seem like it to you right now, but by pinpointing issues, this was a BIG help.

If you need a refresher course in why we are going to all this trouble to begin with, see Guide 1.

Meanwhile, please look further down this page to the MULTINOMIAL EQUATIONS section and study the examples carefully. There are two examples.

The causal model is that gender influences degree level which in turn influences watching science television (these data are from the 2001 NSF Surveys because the SCITV questions were not asked in 2006).

Now, you could do this analysis all at one time in the GENERAL SPSS program (and if I were doing this for a conference or publication, I would). But, for the sake of exercise, we will do TWO logistic regression equations.

In the first example equation, you are predicting four levels of highest degree from gender using multinomial regression. You'll have three equations (k - 1), for (default) the first three levels of degree with graduate work as the suppressed and reference category.

In the second equation, you'll use binomial regression to predict watching science television (1 = yes and 0 = otherwise) from degree level and gender, because SCITV has only two categories. However you will still have three equations because one of your independent variables, DEGLEV4 has four categories and there are k - 1 = 3 equations. The coefficients for the constant (the SCITV marginal) and gender will be the same for all three SCITV categories. However, there will be three, possibly different coefficients for degree level.


To obtain statistics for the various logistic regression programs, historically SPSS first creates a multi-way table comparable to what you see in the GENERAL or MODEL SELECTION programs and calculates the s. It then takes the logits using your chosen dependent variable and turns the s into s.

If there's any problem in the GENERAL program, unfortunately it appears this will translate into the logistic regression program estimates.

In the latest revision of the GENERAL program, you must enter terms from the simple to the more complex. This means when you create a custom model, you first enter the marginal terms, then desired two-way association terms, then three-way terms and so forth. Although your G2s will be fine, very strange things will happen to your lambda coefficients if you enter terms into the model in any other order.

Apparently in the current SPSS version (15 and I suspect 16) you must use the same simple to complex rule in entering your desired terms as you do for the latest version of the GENERAL program (don't worry about terms that only contain the independent variables, you won't see these at all). I didn't catch this before because that's how I typically add the terms in the customized model.
 


As you have learned, the concept of an odds-ratio, including a logged odds-ratio, is not very intuitive.
Further, the LOGGED odds-ratio is symmetric around zero and is negative for fractional logged odds. If you have a very small fractional odds, you will have a very large negative effect for the logit.

In contrast, the odds-ratio for a fraction is sandwiched between zero and one. You will get a much better picture of how your independent variables influence your dependent variable if you use the logit (logged odds) equation rather than exponentiating the s because the positive and negative betas will be symmetric in the logits.

In binary logistic regression, your dependent variable should be coded 1 (high or a success) or 0 (low or everything else).
If you don't code your dependent variable this way, the program will code your first value as 0 and your second (in numeric order) as 1.
This is the opposite of how the MODEL SELECTION and GENERAL programs code binary variables. It's also the opposite of how the binary logistic regression program will code your independent variables.

In the loglinear model, the programs give you the coefficients for the first value of a binary variable and suppress the second value.
Binary logistic regression will treat your independent variables this way, but treats the dependent variable as you see above.

Typically at the beginning of the logistic regression program output, it will tell you how it coded the variables. Hold on to that output and don't delete it when you turn your output in with your paper! Study that information carefully as you interpret your results.

Otherwise you will have difficulty interpreting your results and so will your readers.


They don't, really, although the results will LOOK different. If the G2 increases sizably over the G2 that included the terms of interest (e.g., all three-way interactions), we say the model "doesn't fit" and those terms must be included or "put back" to make the expected and observed counts in the multi-way table coincide within sampling error. Returning these terms to the model, lowers the G2  statistic, hence "small Chi-squares are good." They mean the model fits.

The saturated model and the MODEL SELECTION program will give you a good idea of which terms are necessary to have a model that fits, i.e., produces a small G2.

The saturated MODEL SELECTION program is a real data drudger. It will tell you everything about how your variables fit together, and that's why  I recommend running it first.

And, by the way, if you don't include particular terms, the logistic regression programs, unlike MODEL SELECTION or GENERAL, won't tell you that you need them (another reason to run the saturated model first in MODEL SELECTION to be sure that you included everything important.) That's because of how the logistic regression programs proceed.

The logistic regression programs will first fit the following terms in an underlying loglinear model. You will NOT see these terms in logistic regression because they only address the independent variables and these terms drop out through subtraction as you move to a logit or logistic regression model (review Guide 6 on the algebraic transition from loglinear lambdas to logistic regression betas HERE). These terms are:

  1. All marginals for the INDEPENDENT variables
  2. All two-way associations between INDEPENDENT variables only
  3. All three-way and higher interactions among INDEPENDENT variables only
Next, the program OMITS the terms that you specified between the independent variables and the DEPENDENT variables. (In my first example below, that would be the two-way association between gender and degree level). It recalculates the multi-way table of expected cell counts and compares this with the observed table that incorporates the terms that you specified.

Over the course of a rather messy calculation formula, you will have a new  G for your logistic regression model. This G2  reflects the chi-square difference between the model omitting the terms you specified and the model that incorporates those terms. Similarly the degrees of freedom in this model will reflect the difference between the model omitting the terms and the model that incorporates those terms. The df typically corresponds to the number of terms that you specified in your logistic regression model (including extra terms for a polytotomous variable, i.e., incorporating the number of k-1 values in your independent variables). If this new G2 is statistically significant, it means the model does NOT fit without the terms you specified in it, and those terms must be returned to make the model "fit". This is considered the overall chi-square for your model.

Want to reproduce this G2? Go into the MODEL SELECTION program and run it with the terms included. Then leave those terms out and run the model through the MODEL SELECTION program again. The increase in the G2 should be roughly the same as the G2 you found in the logistic regression package. Just be sure to include in the MODEL SELECTION program run the terms that only address the independent variables to make the program specifications roughly the same as the logistic regression runs (see above).
 

GENERAL THOUGHTS AND LOGISTICS (no, NOT logits)

 

Once again a nice bunch of exercises that show good basic knowledge. Everyone is at the A- or A level overall on the exercise total.

Do you sometimes feel at sea? Not sure what to call those pesky coefficients? The why is simple. This is new material. Math is a language. You would not expect to take a semester of Spanish, Korean, Turkish, or English and speak like a native. You would feel a bit hesitant, perhaps worry a little about embarassing yourself with a grammatical error. The individuals you speak with, on the other hand, are so delighted that you are trying to speak their language that they easily forgive you a slip of the tongue or two.

I found a few "sticky wickets" and enumerate them below.

When your output gives you lambdas, you must convert them to betas (subtraction) and then exponentiate back up to get an odds-ratios. When your output gives you betas, all you need to do is exponentiate (the right beta) back up to get the odds because betas are logged odds-ratios.
MULTINOMIAL EQUATIONS

On multiple categories for a variable. If you have a variable in your model with at least 2 (and up to k) categories, you will need k - 1 equations. The kth equation is obtained by subtraction with the requirement that the sum of the lambda coefficients across rows, columns, and combinations of rows and columns must equal 0. For any equation with deglev4, for example, that means a set of three equations to describe what is happening.

For example, to predict four levels of education (deglev4) from gender, you need separate equations for high school, two year college level and the BA level. These data are for 2001 from the NSF Surveys of Public Understanding of Science and Technology.

DL1 = 2.408DL   + (-) .411 G*DL
DL2 = 0.821DL   + (-) .338 G*DL
DL3 = 0.685DL   + (-) .206 G*DL

NOTE: The constant terms tell us as we go up the educational ladder, the marginal for degree level diminishes and that females (the second category) are less likely to go on to obtain college degrees (the second, third and fourth categories). Traditionally (and still) from the twentieth century on, women have been more likely to graduate high school. In very recent years, women form the majority of AA and BA degrees but surveys of the adult US population obviously include individuals much older than those from recent generations.

Here's an equation for watching science television in 2001, again three equations are needed to predict watching science TV (not available in the 2006 data) from degree level (DL) and gender (G). The three way interaction was not needed in this model, so the results use the modest all 2-ways model. Watching science TV is coded 1 = yes and 0 = no when I use the logistic regression programs. Gender is 1 for males and 2 for females and degree level is coded 1 through 4.

Although the dependent variable (watch science TV) is dichotomous, I still need need three equations, because my independent variable degree level has four categories.

TV = -1.133TV  +  .434 TV*DL1 + (-) .151 TV*G
TV = -1.133TV  +  .204 TV*DL2 + (-) .151 TV*G
TV = -1.133TV  + (-) .228 TV*DL3 + (-) .151 TV*G

The constant ( = -1.133) remains the same because there is only one (k = 2 - 1) independent value in the dependent variable as does the coefficient for gender, which also only has two values (hence one coefficient).

Most people don't watch science TV but relatively more males and those with more education are more likely to watch.

EXAMINING CAUSALITY IN THESE MODELS

The pattern of results and lower sex differences on watching science televisions when we take the differences in degree level into consideration are consistent with what we expect in a mediated relationship. The original independent-dependent variable relationship (gender  TV) becomes smaller when a mediating (or intervening) variable is introduced into the analysis. This suggests that any causal effect that gender has on science television is in part causally indirect. Something about gender (perhaps socialization variables) leads to differential degree levels. In turn, degree level has a greater direct effect on science TV than gender--i.e., it is a more causally proximate variable.

We know, on the other hand, that this is not a spurious relationship. A spurious relationship occurs when the introduced control variable serves as the "real independent" variable and the relationship between the original independent variable and the dependent variable attenuates or becomes smaller.

How do we know this isn't a spurious relationship? Because causally, such a spurious relationship would be diagrammed as you see below:
 

PROPOSED SPURIOUS RELATIONSHIP CAUSAL ORDER

                      ----------> GENDER 
                     / 
                    /
DEGREE LEVEL 
                \
                 \-----------> WATCH SCIENCE TV

Unless you are willing to grant that your highest achieved level of education makes you male or female as well as contributing to watching science TV, this cannot be a spurious relationship.

Depending on tastes and criteria, either a final model included a causal direct and a causalindirect (through deglev4) effect for gender on science TV, or the analyst dropped the direct effect of gender on science TV.

INTERPRETATION

Whether it is an exercise or a conference paper or an article, you must tell your reader what the results mean.

What did your final model look like? What terms did it include?
You can use loglinear abbreviations to describe your final model and equations to show the terms.

What's the causal status of your final model? Did you have direct causal effects, indirect causal effects with mediators, interaction effects with moderators? A causal diagram helps your reader understand the original causal model and the final results you obtained.

Put your results into words. Who was more likely to own a home computer, men or women (or was there no sex difference)? People with advanced degrees or with high school degrees? Did gender continue to affect owning a computer once education or time was controlled? (If not, educational level mediated the gender-homepc correlation, see diagram above for why this would not be a spurious relationship...)

Tell us why (again) we should care about the results.

Does this give us insight about who might sue McDonald's? About why multiple cluster logistic regression will add to our analytic tools? About who is more likely to smoke (so we can target anti-smoking ads, perhaps)? Who goes to alternative schools? Whatever you began with, the interpretation and discussion paper is the place to return to your research problem and tell your reader what happened and what your suggestions are for future research.
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
Susan Carol Losh
April 15 2009