FEEDBACK FOR EXERCISE 1 WILL GO HERE: JUST CLICK HERE!

LOOK FOR EXERCISE 2 SPECIFICATIONS COMING SOON. CLICK HERE!
 
READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

 
 
 

EDF 6937-01       SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 4: BASICS ON FITTING MODELS
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
MODEL TERMINOLOGY
(INCLUDES HIERARCHICAL)
CHI SQUARES &
CHI SQUARES
BASIC MODEL
TESTING
MODEL 
EQUATIONS

A QUICK WAY TO DESCRIBE A MODEL

As models contain more and more variables, it becomes increasingly difficult to describe them by providing an "illustrative table". Four variable tables are difficult enough to present. An example of a four variable table is below. I added ethnicity (White/NonWhite) as a further control variable to the gender-education effects on the science question, and used percentages instead of cell frequencies (by each column) to this example to make the data slightly easier to read. I have also aggregated the results for the years 1999, 2001 and 2006 because, as you can see, some of the cell sizes become small.

PLANETARY QUESTION BY GENDER BY EDUCATION BY ETHNICITY

NONWHITES ONLY

EDUCATIONAL LEVEL SOME COLLEGE OR LESS BA OR MORE
GENDER MALE FEMALE     MALE FEMALE  
EARTH AROUND SUN
71.0%
54.5%
542
 
92.9%
81.8%
151
EVERYTHING ELSE
29.0
45.5
331
 
7.1
18.2
22
100.0%
400
100.0%
473
 

873
 
100.0%
85
100.0%
88
 

173

 

WHITES ONLY

EDUCATIONAL LEVEL SOME COLLEGE OR LESS BA OR MORE
GENDER MALE FEMALE     MALE FEMALE  
EARTH AROUND SUN
79.7%
64.2%
2243
 
94.1%
86.3%
968
EVERYTHING ELSE
20.3
35.8
908
 
5.9
13.7
104
100.0%
1419
100.0%
1732
 

3151
 
100.0%
547
100.0%
525
 

1072

Source: NSF Surveys of Public Understanding of Science and Technology, 1999, 2001, and 2006. Directors: Jon D. Miller and Linda Kimmel; Opinion Research Corporation/MACRO; General Social Survey. available n = 5269.

Even with using percentages instead of cell frequencies, the tables become difficult to read. Tables with five or more variables obscure more than they clarify.


So, what is helpful at this point is to have an easy and succinct terminology to describe models without needing to show a table for each postulated model.

There are two common terminologies.

The first was briefly introduced in Guide Three and uses brackets. Where A, B, C and D refer to particular variables, we can sum up a model with all effects as simply:

{ABCD}

if model {ABCD} is what is called a hierarchical model.
A model is hierarchical if all lower order terms are contained within the model abbreviation. For our four variable generic model, the hierarchical model {ABCD} would include:
 

n
 
fixing the case base
{A}
{B}
{C}
{D}
All marginal effects
{AB}
{AC}
{AD}
{BC}
{BD}
{CD}
All two-way associations
{ABC}
{ABD}
{ACD}
{BCD}
All three-way interactions
{ABCD}
 
The four-variable interaction

As you can see, it is far easier to summarize a hierarchical model with just the highest term(s) in brackets.
 


If we decide that the model that is the simplest and best describes the data is one that (to continue with the four variable case) includes only all three-way interactions and all lower terms, we can describe that hierarchical model as follows:

{ABC}{ABD}{ACD}{BCD}

This model would ensure that the modelled and observed three-way interactions, two-way associations, one variable marginals and n were identical, leaving the cells in the four way table to be functions of the modelled parameters for the expected data and, of course, to be the observed cell frequencies for the "real" data.

A hierarchical model that fixed the  {ABC}{ABD} three way interactions as well as the two-way {CD} interaction would be represented by:

{ABC}{ABD}{CD}

Gilbert and some others use a related terminology with parentheses and asterisks instead. Using his terminology, the four-way hierarchical model would be:

(A*B*C*D)

The second hierarchical model with all three-way terms would be: (A*B*C)(A*B*D)(A*C*D)(B*C*D)

And the third, even simpler, hierarchical model would be: (A*B*C)(A*B*D)(C*D)

Either set of terms is generally recognized by loglinear analysts. 


NON HIERARCHICAL TABLES

When we have hierarchical models, the observed data (and typically the modelled data) generally depart from equiprobability on the two way associations and in the univariate marginals. Thus, the summated terminology above does a good and simple job for many tables.

However, there are several occasions when models are not hierarchical. Then, not only must you describe every modelled parameter in the table with your terminology (see the descriptive table above for all the terms you would include for a four-variable table...) but you also should use particular statistical programs (more on these in future guides) that allow you to model non-hierarchical tables, and in these circumstances, you must be certain to provide every single term you plan to model in your computer program input. Very strange things happen with your computer output, especially with SPSS,  if you don't include all the relevant terms.

Many of the logistic regression programs assume that your model is hierarchical. They won't tell you that, but that's the assumption behind the statistics they present and the degrees of freedom that they calculate.

Below are a couple of instances that could create non-hierarchical tables. In each case, these approximate what Gilbert calls "stratified sampling designs."

EXPERIMENTAL DESIGNS. It is very common in experimental designs to create treatment groups that are all the same size. In part, this dates back to before modern high speed computer programs when it was faster and easier to do hand calculations using Analysis of Variance if all the treatment groups were the same size.

For example, suppose your dependent variable was success rates at quitting smoking (cigarettes) using two treatment groups (nicotine gum, nicotine patch) and a control group that did not receive a nicotine supplement. You also wanted to see if gender influenced quit rates, either alone or in conjunction with a nicotine supplement. Thus you created the following study design. The cell frequencies are the planned number of cases in each treatment
 
 

EXPERIMENTAL TREATMENT 
MALE
FEMALE
NICOTINE GUM
60
60
NICOTINE PATCH
60
60
BOTH
60
60
NO NICOTINE SUPPLEMENT
60
60

As you can see, the gender marginal, the treatment marginal and the gender by treatment association will all be equiprobable (and do not need to be modelled) because of the experimental design. However, higher order terms (perhaps a gender by treatment by cessation rate interaction effect) might NOT be equiprobable.


  DISPROPORTIONATE SAMPLING DESIGNS. We select cases with disproportionate probabilities typically when we have some small subgroups and wish to have enough cases so that our inference tests have sufficient statistical power. Recall that any sample design in which each element has a known and non-zero chance of selection is a probability sample. Thus, disproportionate designs are very often   probability samples. However, we have often oversampled some groups and undersampled others compared with sampling probability proportionate to size.

For example, when we look at the science question X gender X degree level X ethnicity table, we can see that NonWhite respondents are far outnumbered by White respondents in the USA. Currently, Whites comprise a bit over 75 percent of survey respondents. The smaller numbers of NonWhite respondents can present analytic problems when we wish to further subdivide the table, e.g., by gender and degree level. Thus, we may wish to overselect African-Americans, Hispanic-Americans, and Asian-Americans, the groups most prevalent in the U.S. (keeping in mind that Hispanics could be of White, African or Asian descent, thus adding yet another dimension to our sampling design).

One example could be:

PROPOSED DISPROPORTIONATE SAMPLE SCHEME

EDUCATIONAL LEVEL SOME COLLEGE OR LESS BA OR MORE
GENDER MALE FEMALE   MALE FEMALE
WHITES
300
300
 
300
300
NONWHITES
300
300
 
300
300

Because of the sampling scheme in the table above, there will be no association between gender and ethnicity, between gender and degree level, or between ethnicity and degree level in the observed tables. However, there could be higher order interactions among gender, ethnicity, degree level and a dependent variable, such as the planetary question. Hence, this is a non-hierarchical model.

Hopefully no one has any problems seeing why the planetary question would be the dependent variable. But if you do have problems here, please review the material on causality in Guide 1. 


PEARSON CHI-SQUARES AND (LOG) LIKELIHOOD-RATIO CHI-SQUARES

One of the big advantages of loglinear models over more traditional ways of examining multivariate tables is the use of tests of statistical significance to test whether interaction effects and partial association (controlling other variables) effects are zero or are greater than zero. Being able to use these multivariate tests of statistical significance is one of the features that turns loglinear modelling into a system, comparable to N-way analysis of variance or multiple regression instead of the physical control and inspection of separate partial crosstab tables.

In a later guide we will see that we can also ascertain whether certain effects can be dropped or should be retained by examining the specific Z-scores related to effect parameters.

The formula for the traditional Pearson Chi-Square statistic is repeated below. This is the same formula presented in Guide 3. Notice that this is a multiplicative formula because in each term, we divide by the modelled or expected frequency for a particular cell.

Instead of the Pearson Chi-Square, testing in loglinear analysis uses the likelihood ratio Chi-Square statistic (sometimes called the log likelihood ratio statistic due to the logged terms in the formula). One version of the formula for the likelihood ratio Chi-square statistic is given immediately below:

G2 = 2 xij (ln xij - ln mij)

Where ln xij is the natural log of the OBSERVED  cell frequency.
Where  ln mij is the natural log of the MODEL or EXPECTED cell frequency.

[NOTE: I sometimes call this L2 (that's another version).]  As you can see, the likelihood ratio chi-square statistic and the Pearson (multiplicative) chi-square statistic are relatives. In large (estimated over 100) samples, they both have a Chi-square distribution and can use the Chi-square tables. The calculated values are also quite similar in large (bigger than n = 100!) samples but not necessarily in small samples. 


ADVANTAGES OF THE LIKELIHOOD-RATIO CHI-SQUARE

The biggest advantage of the likelihood ratio statistic is that it is additive. That means that G2 can be partitioned and portions of the statistic allocated to different pieces of a particular model.

However, the G2  statistic can only be partitioned for nested models. A model can be said to be nested if the more complex model contains ALL the terms of a simpler, lower order model.

For example, the hierarchical model (A*B*C) contains the three way interaction, the three two way associations, and the three marginal terms (as well as the case base, which we can assume is a typical feature in virtually all loglinear models).

Thus the hierarchical model: (A*B)(A*C)(B*C) is nested in the more complex model (A*B*C) because the model (A*B*C) contains every term that is in the model (A*B)(A*C)(B*C) as well as the additional three variable interaction effect A*B*C.

We can subtract the G2 for the more complex model from the G2 for the simpler model. We can also subtract the corresponding degrees of freedom. The result is ALSO distributed as a G2 with df equal to the difference in degrees of freedom between the two models.

The chi-square statistic for simpler nested models is virtually always larger than the chi-square statistic for more complex models. This is because the more complex model has more parameters, and hence the expected and observed cell frequencies are a better match and the chi-square statistic is smaller than it is for simpler models.

Similarly, the more complex model has fewer degrees of freedom than the simpler model. Because the more complex model has to fit more marginals, associations, and interaction effect, it "uses up" more degrees of freedom than the simpler model, with fewer parameters to estimate, does.

Using the three way table for the 1999, 2001 and 2006 NSF Surveys combined using the planetary question, gender and education, I test models in the section below. But to anticipate briefly:

A hierarchical model that includes the three variable interaction has a G2 of 0 and 0 degrees of freedom.

MODEL A: A hierarchical model that omits the three variable interaction has 1 degree of freedom (because this is a special case where all the variables have only two categories or values) and a G2 of 1.03 (p = 0.31).

MODEL B: A hierarchical model that omits the three variable interaction and also the two way association between gender and degree level has 2 degrees of freedom (we gain back another df by omitting the gender by education association--each variable again has only two values) and a G2 of 2.02 (p = 0.36). The  difference between Model B - Model A is a likelihood ratio statistic G2 of 2.02 - 1.03 = 0.99 with 2 - 1 or 1 df.

In my example, the difference between the two models fell far short of statistic significance (as does each of the models themselves) but that is not always the case. Partitioning the G2 may enable us to see exactly where the important parameters of a complex model lie.

You can ONLY use the G2 partitioning for nested models. You cannot use it to compare two models that are not nested. For example, you could not subtract the G2s and their associated degrees of freedom for the following two models:

{AB}          versus          {BC}

The {AB} model which contains AB, A and B and
the {BC} model which contains BC, B and C

{AB} and {BC} are not subsets of each other. The first model contains the A marginal which is nowhere to be found in the second model and the second model contains the C marginal which is nowhere to be found in the first model.
 
 

A NOTE ON ALPHA LEVELS

Both in the case of evaluating an overall G2  for a model, or assessing the G2  difference for two nested models, we tend to enlarge the alpha level. Partly this is to depress levels of type 2 error which many computer programs do not calculate. For the same size n, an increase in the type 1 error level tends to decrease the probability of a type 2 error (although note that this is not an exact inverse relationship).

In addition, our preference is to go with simpler models that contain fewer parameters to describe the data when this is possible.

Thus, in loglinear analysis, to assess a model overall, or to test the differences across nested models, we tend to adopt a 0.20 type 1, , or probability level MINIMUM. This implies that we will not add additional parameters to describe the model unless it is absolutely necessary.

BASIC MODEL TESTING

A loglinear model is a set of created parameters that generates a [multivariate] table of expected cross tabulation frequencies. As we saw in the simplest association case, the two by two table, in most tables, several models are possible on the same table. However, not all models will fit the observed data accurately, that is, within sampling error.

During model testing, we compare the generated [modeled, expected] cell frequencies with the observed frequencies. If the two sets overall are within sampling error of one other, as it was with each of my Models 1 and 2 above, "the model fits". If the deviations between the two are beyond sampling error, the model is a "poor fit". When the fit is poor, we usually add back parameters that generate new expected frequencies that are closer on the average to the real frequencies that create a "better fit".

We test the fit of a particular model with a likelihood ratio Chi-square statistic. Large G2s mean large deviations between the modeled and observed data, and this means that the model "doesn't fit". Parameters must be added to the model equation (see below) so that the modeled and observed frequencies become more similar to one another. The most complex model, the fully saturated model, generates expected frequencies that exactly match the observed frequencies. Thus, the fully saturated model always "fits perfectly" and the G2s is 0. Most of the time, however, the saturated model is not considered "very interesting."

The parameters in the equations for a loglinear model specify the marginal splits, associations and interactions on univariate, bivariate and multivariate cross tab tables. For example, one important set of parameters creates the "independence model".  In the independence model for a two way table, you select parameters such that the total case base and both univariate marginal distributions exactly match the observed data. That is, you allow the univariate odds-ratios to depart from 1 if that is the case in the real dataset. However the second order odds is set to 1 (ln odds = 0).  This forces the relative frequency distribution (percents or proportions) on the second variable to be the same across each category of the first variable and to match the univariate marginal (e.g., we would set about 25 percent of both men and women to give the wrong answer on the planetary question if this were the total sample percent as shown in Guide 3). You then compared the expected frequencies generated under the independence model with the observed frequencies. With a large X2, you rejected the independence model. The parameter that specified a relationship between the two variables (i.e., made the modelled cells match the observed cells) had to be returned to the model.
 


 We follow a similar pattern of model testing with more complex tables. It is a good idea to first write down the fully saturated model (e.g., {ABC} in the three variable case below) so that you know what all the model parameters are. That way you will have an idea of which effects to eliminate first. I am assuming a hierarchical model unless I mention otherwise.

Then, OMIT the effect from the model that you wish to test and observe the G2 statistic. If the G2  is very large relative to the degrees of freedom, the specified model does not fit. The omitted parameter must be returned to the model to make the observed and expected cell frequencies match within sampling error.

On the other hand, if the simpler model fits, see what additional parameters can also be dropped. You can assess the new model both overall, and also compare it via  partition to a more complex model in which the new model is nested.

In the planetary question by gender by degree level table, we begin with 8 cells and 8 df. We have three marginals (each, in this case, subtracts [2 - 1] X 3 or a total of 3 df), three two way effects (also "eating" 3 df), a possible three way interaction effect (1 df) and the case base (the last df) in the fully saturated model. In the table below, I converted the cell frequencies to column percentages to make comparing the gender and educational groups a little easier

(NOTE: I recommend this when you begin working with tables; it helps you interpret your results and may suggest models that are simpler than the saturated model.)

PLANETARY QUESTION BY GENDER BY EDUCATION 1999-2006

EDUCATIONAL LEVEL SOME COLLEGE OR LESS BA OR MORE
GENDER MALE FEMALE     MALE FEMALE  
EARTH AROUND SUN
77.8%
62.1%
2785
 
94.1%
85.6%
1118
EVERYTHING ELSE
22.2
37.9
1239
 
5.9
14.4
125
 
100.0%
 100.0%
 
100.0%
100.0%
 
 Ns
1819
2205
4024
 
630
613
1243

When we examine the percentages in this three way table, we see that first, regardless of gender, better educated individuals more often get the question right. Second, within each level of education, men more often give the correct answer than women are. These are "joint" effects, both independent variables influence the response variable or planetary question.

Third, there MAY be an interaction among gender, education, and the planetary question. Gender seems to make more of a difference in obtaining the right answer among the less educated respondents (77.8% - 62.1% = 15.7%) than it does among the better educated (94.1% - 85.6% = 8.5%). However, this apparent interaction effect could simply be sampling error. A total sample of 5267 is a nice size, to be sure, but notice that each of the separate educational subtables is smaller than the total n of 5267. (The difference between the n = 5269 is because of how the sample weights are applied which will cause small variations in the case base from analysis to analysis.)

In my computer analysis, I specified a model which eliminates the three way interaction term (A*B*C) and examined the G2 in light of the 1 degree of freedom associated with that interaction. The G2  for this model (which contains all two way associations, all marginals and the case base) was 1.03 with 1 df and a significance level of .31. This means that if the parameter that generates the three way interaction term is really zero, we would expect a G2  this big (or larger) about one-third of the time. This analytic result certainly seems within sampling error and it means that we can drop the three-way interaction among gender, education and the planetary question and the model will still fit the observed data reasonably well.

Can we simplify this three variable model further still? The next candidate I examined was the gender-degree level association. There doesn't seem to be much of a sex difference on the degree level variable. So, let's test a new model that omits both the three way interaction and the gender by degree level association and see what happens.

The model that incorporates all three marginal terms, the case base, the association between gender and the planetary question, and the association between degree level and the planetary question has two degrees of freedom (we added back 1 each for the three-way interaction term and the gender-degree level association) and a G2 of 2.02. The associated Type 1 error or alpha level is 0.36. This model also fits the data well.

On the other hand, models that drop either the gender-planetary question association (G2(2) = 142.02, p < .0001) or the degree level-planetary question association (G2 (2)= 234.51, p < .0001) clearly do not fit. Both these terms must remain in the model ("be fixed") in order to have expected frequencies that are reasonably close to the observed frequencies.

Thus, the simplest model for our three way tables says that both having a college degree and being male are associated with a greater likelihood of giving the correct answer on the planetary science question. Sometimes we call this a "joint" relationship because we have two independent variables (gender and degree level), not just one to explain answers to the science question. On the other hand, women are about as likely as men to have a college degree. In terms of the loglinear equations below, where A = gender, B = degree level and C = planetary question response, we could describe the model this way:

GijkiAjBkCikACjkBC

In a further guide, we will discuss the lambda numeric parameters in the equation. Note here that these are additive parameters because the s are logged from the original multiplicative equations.

Using abbreviated model notation, either set of terms below would also describe this model:

{AC} {BC}
(A*C)(B*C)
 


BASIC LOGLINEAR EQUATIONS

The loglinear model we have been working up until now with is often called the General Cell Frequency Model (GCF)

In the GCF model, we are trying to predict or model a cell frequency. Cell frequencies can be created by marginal splits (in a 1 by k table), two variable cross-tabulations, cross tabulating two variables within categories of a third and so forth.

Of the models we consider this semester, the GCF model allows  the most flexibility. We readily observe all associations, including those among independent variables. We can test "path-like" causal models (I will draw some in class), check for indirect causal effects and statistical interactions (specifications) more readily in GCF models than in other kinds of models, e.g., logit models or logistic regression.

Predictors that have an effect (are statistically significant) in GCF models raise or lower the predicted cell frequencies (Fij in the original multiplicative model and ln Fij = "Gij" in the logged additive and linear model) in a multivariate crosstabs table. Negative (logged)  Gij parameters mean fewer frequencies in a cell than would occur with a predicted equiprobable or no effect model. Positive (logged)  Gij parameters parameters mean more frequencies in a cell than an equiprobable or no effect model would predict.
 


The original formula that produces the cell frequency is multiplicative (think back to the probabilities divided by the case base in Guide 3).

A   X B  X C   X  n

for a three variable A by B by C table.

More formally, we write out the loglinear equation as a set of parameters using eta for the "grand mean" (the equiprobable model) and taus for the variables and combinations of variables. For example, for three variables, A, B and C, the formula for the fully saturated loglinear model is given below. This is a multiplicative model that predicts the literal cell frequencies (not in logged form)  Fijk.

EQUATION 1:      FijkiAjBkCijABikACjkBCijkABC

recall that:          ln (A * B)  = ln A + ln B

and:                     ln (A/B)  = ln A - ln B

Multiplicative and nonlinear coefficients are generally more difficult to interpret than linear and additive parameter coefficients. By taking natural logarithms of both sides of Equation 1, we can create an additive and linear model equation in Equation 2, hence the term "loglinear". In the transformed model equation for the parameters, the ln Fij are now called "Gij" and the new parameters are lambdas rather than taus. So Equation 1 now becomes:

EQUATION 2:      GijkiAjBkCijABikACjkBCijkABC
 

The theta parameter (eta in the multiplicative model) is for the "grand mean" or "fixing n" in the equiprobable or simplest model.

In later guides, we will see how this equation can be transformed in the cases of logits or logistic regression.
But make no mistake about it: what you see above in the loglinear basic equation is THE basic equation.

It is the foundation for logistic regression but contains much more. It is what allows you to simultaneously explore the relationship among possible independent variables (in the two variable associations) as well as possible indirect causal effects on a postulated dependent variable.

Equations 1 and 2 addressed the fully saturated model. However, our goal is to have the simplest possible model that fits the data with the fewest number of parameters. For example, if the independent variables are uncorrelated, these terms can be dropped from the model (e.g., ijAB  ). If a third order interaction (e.g., ijkABC ) is unnecessary, it can be dropped from the equation as well, as I did above in testing the gender-educational level-planetary question model.

For example, the hierarchical model {AB}{AC} (which also contains the {A}{B} and {C} parameters) is written this way:

EXAMPLE MODEL:      GijkiAjBkCijABikAC
 
 
OVERVIEW
READINGS

This page created with Netscape Composer
Susan Carol Losh
February 3 2009