Here is the link to the web pages on three variable crosstabulation tables and causal issues:

http://mailer.fsu.edu/~slosh/IntroStatsGuide6.html

Here is the link to the web pages on online data archives:

http://mailer.fsu.edu/~slosh/MethodsGuide8.html
 
READINGS GUIDE 1: ISSUES IN MODELING
GUIDE 2: TERMINLOGY
GUIDE 3: THE LOWLY 2 X 2 TABLE
GUIDE 4: BASICS ON FITTING MODELS
GUIDE 5: SOME REVIEW, EXTENSIONS, LOGITS
GUIDE 6: LOGLINEAR & LOGIT MODELS
GUIDE 7: LOG-ODDS AND MEASURES OF FIT
GUIDE 8: LOGITS,LAMBDAS & OTHER GENERAL THOUGHTS
OVERVIEW

AN AGRESTI NOTE:  Agresti places relatively more importance on various logistic regression models, whereas I place relatively more importance on loglinear models. I wanted to mention a few important reasons for the discrepancy, for my preference, and the order of topics and readings:

Loglinear models are the BASIC models and logistic regression and logit models are derived from and tested by the underlying and corresponding loglinear model.
Other models are also derived from the underlying loglinear model, such as ordinal regression.
If you are comfortable with loglinear models, all the other derived models will be much easier for you (based on my teaching experience of this material).
Causal models can be tested relatively easily with loglinear models but virtually all logistic regression models are essentially only two causal stage models similar to an OLS regression model. I'm often interested in causal models so this is the way to go.
 
 

EDF 6937-01       SPRING 2009
THE MULTIVARIATE ANALYSIS OF CATEGORICAL DATA
GUIDE 2: TABLE NOMENCLATURE
Susan Carol Losh
Department of Educational Psychology and Learning Systems
Florida State University

 
COMPONENTS 
OF A TABLE
BASIC BIVARIATE TABLE REVIEW
MULTIVARIATE TABLES
ODDS AND ODDS-RATIOS
MAXIMUM LIKELIHOOD ESTIMATORS

 A TABLE is a common and useful way to present data.

Don't be a snob about tables! A table is the most useful basic building block in your tool chest of data analytic techniques and presentation of your results. If you can construct a simple table thoroughly, everyone (including you) will be able to assess your basic results. You could even write an entire dissertation using tables alone. And, of course, tables form the bedrock foundation of this course material.

Further, as we will shortly see tables can become increasingly complex. You can (and we will) present joint distributions of three or more variables in tables. 



COMPONENTS OF A TABLE

 
LET'S KEEP OUR TERMINOLOGY STRAIGHT
(no matter how basic it seems right now)

 
These are rows.                                                   1
A row stretches from the left hand side of the table to the right hand side of the table. 
By convention, the top row is number 1.
 

 
Below are columns. 
Columns start at the top of the table and plummet straight down to the bottom of the table.
By convention, the FAR LEFT column is designated number 1.

 
1






















 


 
 
 
 
 
 
 

 


 
 
 
 
 
 
 

 


 
 
 
 
 

 


 
 
 
 
 
 
 

 


A univariable table addresses one variable at a time. Sometimes univariate distributions are called "the marginals" (since they form the margins of a two-way or N-way table) or "marginal distributions."

A bivariate table addresses the simultaneous joint distribution of two variables.For example, the following combination of values example  jointly and simultaneously cross-classifies each individual on two characteristics, their college and their gender:

Male Business Major
Female Business Major
Male Education Major
Female Education Major
Male Humanities Major
Female Humanities Major

and so on.

The juxtaposition of a particular row in the table with a particular column produces a "cell" in the table.

By convention, we give the row first and the column second to locate a particular CELL in a bivariate table. The "female, business major" cell is row 1, column 2 or just "1,2" for short.

Staying with the example of gender and college major, we could present a bivariate table  as follows:

DISTRIBUTION OF COLLEGE MAJOR BY GENDER AT FLORIDA STATE UNIVERSITY 2004
 

College Major Gender Male Female Row Totals
Business  
A CELL
 1,2  
Education    2,1  2,2  
Humanities    3,1  3,2  
Other majors,  entered row by row        
Column Totals (for males and females)        

The table itself is a rectangular array in at least two dimensional space. It typically shows the value or category of the variable and the number of cases that fall into that category. It may also give the percentage of cases that fall into each category (but usually not both frequencies and percents)..

There are a lot of ways to construct univariate and bivariate tables. This Guide follows commonly accepted statistical practices.


 
A UNIVARIATE EXAMPLE

Let's start by examining one variable at a time. Consider the following univariate frequency distribution AND percentage table from the February 2004 Current Population Survey (supplement), which is regularly conducted by the US Government:

TITLE: Percentage of United States Households with a Particular Type of Telephone (est) 2004
 

Type of Telepone Usage
Number of Cases
Percent of Total Cases
    Landline only
12,128
37.9%
    Cell phone only
1,824
5.7 
    Both
16,512
51.6 
    No telephone
1536
4.8 
Total households
32,000
100.0%

Source = February 2004 Current Population Survey. David Morganstein, Clyde Tucker, J. Michael Brick, James Esposito, and Brian J. Meekins (2004) "Household Telephone Service and Usage Patterns in the United States in 2004: A Demographic Profile."  Online at:http://www.bls.gov/osmr/abstract/st/st040130.htm


 THERE ARE (AT LEAST) SIX COMPONENTS OR PIECES OF INFORMATION THAT "THE PERFECT TABLE" MUST TELL YOU. These components are:

1. THE TITLE

Each table must have a title that briefly but accurately describes the contents of the table. This means that if, for example, you have a bivariate distribution, you should include the names of BOTH VARIABLES in the title, such as "Type of Telephone Use by Household Income".

2. THE VARIABLE(S) OF INTEREST

In my univariate example, there is ONE variable of interest, what kind of telephone (if any) there is in the household.

3. THE CATEGORIES OF THE VARIABLE

In my example above, there are four categories: Landline only, Cell phone only, Both and No Telephone(which may also include the miniscule categories of refused and don't know). The percentage of households with no telephone, by the way, has been roughly constant for over 10 years.

4A. THE NUMBER OF CASES OR THE FREQUENCY IN EACH CATEGORY OF THE VARIABLE

The total collection of every category name with its associated frequency is the univariate frequency distribution.

4B. THE PERCENTAGE OF CASES IN EACH CATEGORY OF THE VARIABLE

Percentages are very handy, because they are a standardized measure, or on a "per 100" standard. The total collection of every category name with its associated percentage is the univariate percentage distribution.

This section reads 4A or 4B because typically either the frequency distribution OR the percentage distribution is presented, but not both. Presenting too many numbers clutters the table and makes it more difficult to read.

5. THE TOTAL CASE BASE

In my example, that's (about) 32,000 households. If you have the total case base and the percentages, you can recalculate the category frequencies (which, in fact, I had to do in this example).

You would be amazed how many people omit the total case base, including authors in professional journals. In the back of my mind, I always get just a little suspicious.

What are they hiding? Were they ashamed of the number of cases? ("Three out of every four dentists recommend Blanca toothpaste" isn't very impressive when there are only four dentists.)

Did they manage to forget how many cases they collected?

6. THE SOURCE OF THE DATA

Who collected these data? The United States government? A freshman undergraduate for a psychology project? Your Aunt Millie?

It is important to know the source of the data because clearly some sources have more of a reputation for collecting data in a trustworthy, systematic and reliable way than others.

The United States government, the National Opinion Research Center (NORC), federal agencies of many countries around the world, certain private companies (e.g., the Roper Company), all have excellent reputations for the care that they take in data collection. You will want to know the collector of the data so that you can interpret the data in context.

BIVARIATE DISTRIBUTIONS

A bivariate distribution simultaneously and jointly cross-classifies the scores on a case for two variables.

For example, if we have a bivariate distribution of gender and support for President Obama (favorable/unfavorable) we can simultaneously cross-classify people as favorable males, favorable females, unfavorable males and unfavorable females.

The jointly cross-classified cases form the "cells" or interior of the table. Each cell has a frequency of cases that have a JOINT score considering both variables simultaneously.

The univariate summaries for each variable separately (for example, male or female) are at the bottom of the table for the independent variable and at the far right of the table for the dependent variable,  and are called the marginals. Because the row and column totals are in the margins of the table, they are often called "the marginals". Remember that the cells are labelled with the row number first, then the column number.

The grand total is usually presented in the lower right corner of the bivariate table.

Title: Generic Bivariate Table
  Variable X, Value 1 Variable X, Value 2 Row Totals 
Variable Y, Value 1 (Cell 1,1) (Cell 1,2) Marginal Total 
Variable Y, Value 1
Variable Y Value 2 (Cell 2,1) (Cell 2,2) Marginal Total 
Variable Y, Value 2
Column Totals Marginal Total 
Variable X, Value 1
Marginal Total 
Variable X, Value 2
Grand Total

Then, with values supplied for each variable:

Title: Attitude toward President Obama by Gender
  MALE FEMALE  
FAVORABLE Male-Favorable (Cell 1,1) Female Favorable (Cell 1,2) Total Favorable
UNFAVORABLE Male-Unfavorable (Cell 2,1) Female Unfavorable (Cell 2,2) Total Unfavorable
  Total Male Total Female Grand Total

These terms: cell, grand total, and marginal total are important because each of them, as well as combinations of table cells, can become a parameter to be modelled in loglinear analysis.
 
THE "SIZE" OF A CROSSTABULATION TABLE

The size of a crosstabulation table (which is the total number of cells) depends on how many rows and columns are in the table.

In turn, the number of rows or columns depends on how many values or categories each variable has.

If the row variable has 3 categories and the column variable has 4 categories, the result is a "3 by 4" table.

CONVENTION: The row number always comes first.

Square tables have the same number of rows and columns (e.g., a 2 X 2 table such as the example above).

The size of the table and the number of marginal categories for each variable become critical in loglinear models in setting the degrees of freedom. The total case base becomes important in statistical significance testing.
 


 BIVARIATE PERCENTAGE TABLES: FAST REVIEW

The Bivariate Percentage Table is a variation on our old friend, the univariate percentage table. However, the bivariate percentages allows us to compare and contrast group similarities and differences. I have the very simplest bivariate table, which is a 2 X 2 table, below. There is one column each for women and men, one row for the correct answer and one row for the incorrect answer. The first table shown (also found in Guide 2) is the Bivariate Frequency Distribution:
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
NOTE: By convention, categories of the independent variable typically form the COLUMNS of the table. (see below) Male Female Total
Answer to Question:
Sun goes around Earth (WRONG) 146
(r1, c1)
305 451
Earth goes around Sun (RIGHT) 652 715 1367
Total (at the bottom of each column are SEPARATE totals for women and men, then a total for everyone combined) 798 1020 1818

Source: NSF Surveys of Public Understanding of Science and Technology, 2006, Director, General Social Survey (NORC).
n = 1818

A key issue is whether to percentize down the columns or across the rows.
Make no mistake about it, this IS a key issue and not a matter of semantics. Percentizing in "the wrong direction" will totally change the meaning of the results that you present.

CONVENTION: Values of the independent variable create the columns of the table.

For example, the two values of gender: male and female, head each column in my sample table.
Remember, gender might cause science knowledge, but we know science knowledge CANNOT cause biological sex. Therefore, gender is the independent variable. Science knowledge is the possible effect, or dependent variable. (Review Guide 1on causal order in non-experimental data if you are not convinced.)
 
 
IMPORTANT NOTE: AGRESTI AND TERMINOLOGY
Sometimes you will notice that Agresti "flip-flops" on whether the categories of the  independent (explanatory) variable or the dependent (response) variable form the columns of the table. Mostly he has the explanatory or independent variable as the row variables, rather than the column variable. However, because the norm (see most journal articles in the social and behavioral sciences) is to have the independent variable form the columns of the table, I will follow the convention of having the independent variable comprise the table columns throughout this course.

CONVENTION: Percentize separately within values of the independent variable.

In my example, this means that first I calculate the percent giving correct and incorrect percentage responses for men. I then repeat the process, calculating the percent giving correct and incorrect responses for women.

Once I have done so, I can now specify the percentage of men who give the right answer (the Earth goes around the Sun) and the percentage of women who give the right answer, and then directly compare women and men.

These percentages within gender are different numbers, and they mean something entirely different from the following question:

among those who think the Earth goes around the Sun, what percent are female?
(Answer 715/13677 X 100 or 52.3% Since women are 1020/1818 X 100 or 56.1 percent of the sample, we can see that women are underrepresented among those giving the correct answer. Notice below that neither column has a percent figure of 52.3%)

CONVENTION: Remember that when the columns are formed by categories of  the independent variable, a percent sign ONLY goes at the top of each column (in this case, the "wrong answer") and after the 100 percent at the bottom of each column. (Note: this is because the values of the independent variable form the columns of the table.)

These conventions are particularly important as the number of values for each variable grows and as the number of variables grow. They help your reader to immediately discern which way the percentages are calculated and they make your table much easier to read.
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
  Male Female
Answer to Question:
   
Sun goes around Earth (WRONG)
18.3%
29.9%
Earth goes around Sun (RIGHT)
81.7 
70.1 
 Total
100.0%
100.0%
Casebase
798
1020

Source: NSF Surveys of Public Understanding of Science and Technology, 2006, Director, General Social Survey (NORC).
n = 1818 (P.S. Yes, these are real data and real percentages. Sigh.)
 

MULTIVARIATE DISTRIBUTIONS

A multivariate table jointly and simultaneously cross-classifies each individual on at least three characteristics. For example, a Hispanic Female Business Major is simultaneously cross-classified on her ethnicity, her gender, and her college.

For example, we might have a multivariate distribution on gender (male/female), marital status (married/not), and presence of children (yes/no). We can simultaneously cross-classify people as one of the following combinations:

The size of the multivariate table is the number of rows multiplied by the number of columns (in the "original" table) THEN multiplied by the number of categories or values in the third or "control variable."

Using my example above, with two marital status categories (married versus unmarried), two sexes (female and male), and two parental statuses (with children versus without), this is a 2 X 2 X 2 or 8-cell total table.

We could set up our example this way:

  Women   Men
  Married Not Married   Married Not Married
Has Children MarFeKid  UnMarFeKid  Has Children MarMenKid UnMarMenKid
No Children MarFeNoKid  UnMarFeNoKid  No Children MarMenNoKid UnMarMenNoKid

Notice that we have now created TWO separate tables side by side that examine how marital status relates to the presence of children in the home, one for men and one for women.

The use of separate or "partial" tables or "subtables" to examine the original relationship between two variables within categories of a third variable (e.g., looking at how marital status influences the presence or absence of children separately for each category of a control variable such as gender) is also called physical control because we have physically separated the cases into groups using the control variable. Generally physical control is an inefficient way to analyze cross-tabulation tables.


 

ODDS AND ODDS RATIOS

The odds of a variable is simply the ratio of one of the category frequencies in a variable to another category frequency in the same variable. Although we can take odds on any variable, in bivariate and multivariate analysis, if we are able to designate a dependent (response) variable, typically we take the odds on the categories of the dependent variable.

Typically, in a binary (two category) variable we designate one of the categories as a "success" and the other as a "failure". This has nothing to do with the social or emotive meaning of success and failure. For example, in examining death rates from a disease, we might easily designate death as a "success" and survival as a "failure".

Odds can vary from zero (no successes at all) to infinity. They are undefined when there are no "failures".
Odds are fractional when there are more failures than successes. For example, if most people with a disease survive, the odds will be fractional. Unlike the Linear Probability Model, the odds does not have a truncated variance.

Let's re-examine the table for gender and the "planets" question.
 
 

 How Gender Influences Answers to the Question: Does the Sun go around the Earth or does the Earth go around the Sun?
GENDER-->
Male Female Total
Answer to Planetary Question:
Sun goes around Earth (WRONG) 146 305 451
Earth goes around Sun (RIGHT) 652 715 1367
798 1020 1818

Source: NSF Surveys of Public Understanding of Science and Technology, 2006, Director, General Social Survey (NORC).
n = 1818

The odds on GENDER (using Male as a "success") for the total group is 798/1020 = .78
That is, someone is 78% as likely to be male as to be female.

The odds on the PLANETARY QUESTION (using the right answer as a "success") for the total group is 1367/451 or 3.03. Right answers occur about three times as often as wrong answers do.

We can calculate conditional odds within each category of the independent (explanatory) variable. For our tabular example above, the odds of giving the correct to the incorrect answer are:

For men:  652/146   or 4.47
For women: 715/305  or  2.34

Although both sexes are more likely to give right answers than wrong answers, men are over four times as likely to give a right:wrong answer whereas women slightly more than twice as likely to give a right:wrong answer.

We can then examine the second order odds or odds-ratio. This is comparing the odds across groups or categories of the independent variable. In this instance (remembering again that "successes" form the numerator) we would divide the odds for men on the planetary question by the odds for women on the planetary question or:

4.47/2.34 = 1.91

When the odds-ratio or second order odds is 1, we have the case of statistical independence, i.e., the independent variable has no influence on the dependent variable. The odds on the dependent variable are the same regardless of whether one is male or female.

In this example, the odds-ratio of 1.91 is suggestive. Men are a little under twice as likely to give the right answer compared with the wrong answer as women are. It is important in the interpretation of the odds that I give both the successes and failures. After all, a glance at the table should reassure you it would be incorrect to say men were twice as likely to give the correct answer than women. This would be untrue. But the odds for men is nearly twice that for women.

I say this is suggestive because at this point we have not done any formal tests of statistical significance. But have patience because before you know it, we will.

What happens if your variable has more than three categories?
In this case, several possibilities open up. Some of them depend on whether the categories have an inherent order to them (ordinal data) or whether category names or values are simply labels or "tags" (nominal data).

One possibility is to examine the odds successively of a "success" in turn to all other categories combined (recall Jim Davis of Chicago, Harvard and Dartmouth said years ago we can always dichotomize the states of the USA into New Hampshire versus all other states combined). In the case if ordinal categories we may wish to calculate odds on adjacent categories. If we have K categories, we can form K-1 odds. We cannot do K odds because the Kth odds is linearly dependent upon the first K-1 odds.

We call the natural (base e/Euler's constant/2.71...) log of the odds-ratio the "log-odds" or logit. I use the abbreviation "ln" to distinguish base e from base 10 or log 10 logarithms. Most statistical uses of logs use base e lns.

In later guides we will see how we can extend the odds ratio to any number of variables and how to calculate them. Get comfortable with the terminology for now.

MAXIMUM LIKELIHOOD ESTIMATORS

We are used to null hypotheses that say in essence: if the parameter we're interested in is zero, what is the probability or likelihood of observing the results we did by chance?

We are used to direct estimatation methods of parameters that basically take a one step procedure, whether using linear (matrix) algebra or partial derivatives in calculus. That's what we do in OLS regression.

But causal model statistics are different, whether they are structural equation models or loglinear models. Most use Maximum Likelihood Estimators (MLEs) and iterative methods to solve.

With MLEs we ask the question: given the data, how likely are the parameters (sort of the reverse of sentence 1)? With MLEs we estimate a host of possible parameters, each with a probability attached to it (see the graphs in Agresti). The MLE solution you see on your computer output is the one with the highest overall probability attached to it, hence MAXIMUM likelihood estimators.

With large samples, MLE models often give results that are similar to those estimated through more direct methods, such as Ordinary Least Squares regression. But in many cases MLE parameters are quite different. This is especially true with more complex models so do not think you can always substitute different kinds of estimators.
 
 
 
OVERVIEW
READINGS

 

This page created with Netscape Composer
Susan Carol Losh
January 13 2009
January 9 2009