FALL 2013



At last! We now focus on different ways of conducting studies and gathering data. Each data collection technique has its own set of strengths and weaknesses. That is why it is advisable over the long run for a researcher to conduct a series of studies, all with the same independent and dependent variable(s) but using a mix of experiments, ethnographies, surveys, content analysis, focus groups, and so forth. (Check for these followups in the materials you read germain to your field.)

As we saw in Guide 3, a major strength of true experiments is causal control and strong internal validity. Various threats to internal validity are described in more detail below. In an experiment you can literally build your own independent variables by:

(1) Creating "factors" or levels of some kind of treatment then
(2) Randomly assigning participants or groups to different levels of the treatment.

It is RANDOMIZATION that is the major contributor to making an experiment a true experiment. Randomization controls for everything that you can think of as an alternative causal explanation--and everything that you cannot.

However, a true experiment is simply not always possible, yet investigators still want to make causal statements. Finally, even if you have conducted a true experiment, all experiments do not have equally strong causal control. Issues with reactivity, with poor measurement, and the nature of control groups can all influence the degree of internal validity in an experimental design.



What makes a true experiment is random assignment of people or groups to treatments. Human judgment plays no role in who gets which experimental condition. The strength of randomization is that it creates two or more groups that are approximately equivalent in the very beginning on the average on just about any characteristic you can imagine.

Of course we are speaking long term and reasonable size samples. If you have two groups of five people each, I wouldn't count on them necessarily being very similar. However, even with as few as 10 people per group you will begin to see the beauty of randomization as a research design.

But randomization just isn't always possible. Some treatment groups are initially formed on the basis of performance (high, medium, low, for example), some variables (e.g., bipolar depressive disorder) just aren't experimentally induced.

If your study has different levels of treatments, and people or groups are assigned to those treatments WITHOUT random assignment, you have a quasi-experiment.

It's not just having intact groups that creates a quasi-experiment. Individuals who are not in intact groups could enter treatment levels through self-selection, because they are in a particular performance category (that bottom quartile in performance, for example), or because a researcher has "paired" individuals that she or he believes are somehow similar.

However, in cases such as these, self-selection, or regression toward the mean effects are alternative explanations for why you found the results you did instead of the treatment.

The difficulty, in quasi-experiments, is trying to find out just how similar the groups were at the very beginning, before any treatment at all began. Sometimes, in fact, if groups are created on the basis of dissimilarities, such as ability, we know the groups are different at the very beginning. If we have basic or prior information about those who comprise the individuals or groups in the different treatments we may, at least, try to institute statistical controls for those variables.

For example, often a school will introduce an experimental new technique, as yet not well evaluated. Scores on some student measure are taken at the beginning and the end of the study period. Was there any kind of comparison group? Was it a true control group? What did the comparison or control group do instead of the experimental treatment?

How might you find out about just how similar or different groups were at the beginning of a study?

Background information. You might have access to grades, test scores, "personality" or other standardized test results collected before the study ever began.

Some kind of pretest measures. These vary from requesting background or "demographic" information such as own or parental education, occupation, or income, to various standardized tests. Be careful, however! Remember that pretests can sensitize people that their behavior is under study and lead to pretest-treatment interaction biases.

Supplemental information from other people. Interviews with teachers, parents, physicians, therapists, or others who know the subjects of study well may provide supplemental information.
Here's the basic problem: even if we assign groups to treatments based on their differences, such as a high ability and low ability group, the groups may differ in other respects, on variables that we never measured at all. For example, the high ability group may be more motivated or more confident, on the average, than the low ability group. And it is perhaps those differences in motivation or confidence, instead of the differences in ability (that YOU thought was the true independent variable), that were the true causes of the treatment outcome differences that you observed.

THREAT!  Even if you are able to obtain background, pretest, or supplemental information, you may have never measured the true differences between your groups on other variables. And those true differences that you never measured could be the real causes responsible for the outcome effects that you found in your study.

Now you can begin to see why quasi-experimental designs pose threats to internal validity.

Be patient for a little bit! We will return to the issue of intact groups shortly. Meanwhile, remember if people were initially assigned to intact groups in a random fashion, you may have a true experiment after all. if you want a review from Guide 3, click HERE.


Many of the types of quasi-experimental designs are very similar to true experimental designs except that randomization never takes place.

Just as we have in experiments, one group may be assigned a treatment. Then, following the treatment, we measure some type of observation or dependent variable for both the group that received a treatment and the group that did not. Here is one comparison of a quasi experimental design with the corresponding "true" experimental design:

Where "X" is a particular treatment or intervention and "O" is a measured outcome, and "R" indicates whether participants were randomly assigned to treatment groups.

Why is the control group "nonequivalent" for the quasi-experimental design? Because we did not use random assignment to place subjects in treatment groups, we cannot assume that on the average the groups are the same, or equivalent, to begin with.

However, two types of design often conducted more with quasi-experimental situations include the time series design (sometimes called a "natural experiment") and the case study.

In the time series   design, you have several observations over time. While you may have some type of experimental intervention, often "nature" does the experimenting for you:

In all these cases, we assume that you had a series of "pre-intervention" measures, or that you are able to obtain a series of pre-intervention measures, which you then continue following the intervention. In all likelihood, you don't have a control group (let alone a randomized control group) so you can't tease out specifically what it was about the intervention, legislation, or pharmaceutical that caused the outcomes you observed. If you have enough advance warning, you may be able to have more groups, although without randomization of treatments to groups, you still have many of the threats to internal validity listed below.

Case studies occur with some frequency in both medical, educational and therapeutic fields. Practitioners who work one on one (such as counselors) or with very small groups (special education classes) are the most likely to use case studies. Subjects are not random, the case base is small, and there may be no control group. As you can guess, causal inference is much more difficult. For example, some people believe that they were abducted by Unidentified Flying Objects (UFOs) then returned to Earth. Much of the research on such individuals is conducted by clinicians. Their patients comprise the sample (sometimes a sample of one) and it uses a case study approach wherein inferences are made about the person's proneness to fantasy construction. However, due to the lack of comparison groups, it can be difficult to ascertain what the true causal variables are.

What's the best you can do under such circumstances? Impose a time series of observations if possible. If the intervention is under your control (dispensing a new medication, for example), impose the intervention, remove it, impose it again, remove it, and so forth. Try to use a double blind (see below) administration if possible and the most objective outcome measures that you can find.


Many research methods textbooks virtually define quasi-exeriments as those using intact groups, i.e., groups that existed prior to any treatment or intervention. Normally (say, 75 to 80 percent of the time) this is true. What are important are:

(1) HOW subjects entered the groups in the first place;
(2) What happens in the group; and
(3) The length of time groups pre-existed prior to interventions.

If  subjects are randomly assigned to groups in the first place (which often happens in schools and universities for classes where there are many equivalent sections), AND the tasks to be performed prior to the intervention are virtually identical in each group, AND the pre-intervention time is short (probably a few weeks at most), THEN if you randomly assign groups to conditions, you probably have a true experiment.

Consider some of the alternatives. Participants may be assigned to groups using pre-existing knowledge about them, and the groups consequently differ on variables related to the study. Sports teams grouped by ability, "tracking" systems in schools, and enlisted versus officers in the military are three examples.

Even if random assignment originally places participants in groups, their curricula and itineraries may be different, thus providing subjects in different groups with different experiences.

Finally, bosses and teachers differ in their approaches, again providing subjects in different groups with different experiences which diverge further as time goes on.

So, if you use, for example, randomly selected sections of basic college math, random assignments to treatments AND do at least much of your data collection at the very beginning of the academic year, you probably have a true experimental design. Do your data resemble all these criteria in the example? If not, your design is probably quasi-experimental.


Threats to internal validity are essentially threats to causal control. They mean that we do not know for sure what caused the effects that we observed. Naturally, we like to hope that our interventions (experimental treatments) or other known and measured independent variables caused the effects. Unfortunately this is often not the case. For exampe, because of their multidimensionality, confounded variables (which measure more than one entity) are a threat to internal validity.


If you have tight control over your experimental treatments (and, of course, you used randomization), hopefully the only source of variance left in your dependent variables will be random error.

Random error is just that: It is the random variation that occurs on measurements across administrations, situations, or time periods. If random error is VERY large, it can pose a threat to the reliability (predictability, stability) of our measurements. Many political attitudes, for example, are highly unstable or volatile.

On the other hand, because it is random, random error does not usually pose a threat to internal validity.

Bias is systematic error, such as the scale that always weighs you in at five pounds too light. Bias introduces a constant source of error into measurements or results. Bias can occur when test items that favor a particular ethnic, age, or gender group are used. For example, a "culture exam" that asked respondents to identify songs from the 1950s and the 1960s would discriminate against much younger people. Tests of "science knowledge" often favor younger people because they use the most recent definitions of science phenomena and thus favor those with a more recent education. Bias in testing instruments is a threat to internal validity because it poses an alternative explanation for the results that we found.

Most of us have scales that weigh us as lighter than those at the doctor's office. Hmmm.

If we could either control bias experimentally (random assignment controls much of it by making experimental treatment groups roughly equivalent at the beginning of a study, thus controlling factors such as self-selection or regression toward the mean effects) or measure the variables we suspect cause bias and thus control them statistically, we would at least maximize internal validity.

Unfortunately bias is often hidden, either in the variables you didn't measure--or the variables you didn't consider at all. Thus you didn't measure it and only discover your mistake after all your data are collected. Confounded variables are a major threat to internal validity.


Self-selection effects : When participants can select their own treatments (e.g., students who decide whether or not to respond to an online survey), we do not know whether the intervention or a pre-existing factor of the participant caused the outcomes we observed. Random assignment can cure this problem. The same problem can occur with differential selection, only in this case, the investigator (rather than the participant) uses human judgement to assign groups or subjects to treatment. A common variation on this one is selecting extreme groups (see below).

Experimental mortality: When participants discontinue their participation in a study and this occurs more in certain conditions than others, we do not know how to causally interpret the results because we don't know how people who discontinued participation differed from those who completed it. A pretest questionnaire given to all subjects make help clarify this, but watch out for pretesting effects (a Solomon four group design can help here, see Guide 3.)

History: Some kind of event occurred during the study period (such as the assaults on New York City) and it is reactions to these events that caused the outcomes we observed. Sometimes this is a medical event (such as a flu outbreak) and sometimes an actual political or historical event. Random assignment and a control group helps with this problem.

Maturation effects are especially important with children and youth (such as college freshmen) but could happen at any age. For example, young children's speech will normally become more complex, no matter what reading method you use. Some studies have found that most college students pull out of a depression within six months, even if they receive no treatment whatsoever. A certain number of people will stop smoking, whether they receive treatment or not. Again, a randomized control group helps.

Regression toward the mean effects ("statistical regression") are especially likely when you study extreme groups. For example, students scoring at the bottom of a test typically improve their scores a least a little when they retake the test. Students with nearly perfect scores might miss an item the second time around. That is, people with extreme scores, or in extreme groups, will often fall back toward the average or "regress to the mean" on a second administration of the dependent variable.

Regression toward the mean effects are especially likely to occur among well-meaning investigators, who want to give a treatment that they believe is very beneficial to the group that appears to need it the most (the top scoring group is usually left alone.) When the scores of the worst group improve after the intervention (and the top group scores a little lower on the readministration), misguided investigators are even more convinced that they have found a good treatment (instead of a methodological artifact.) How to avoid this threat to internal validity? Either avoid extreme groups, or if you do use them, randomly assign their members to treatment conditions, INCLUDING A CONTROL GROUP. Thus, among the lowest scoring students, one third would receive intervention #1, one third would receive intervention #2, and one third would receive no intervention at all.

Testing. Just taking a pretest can sensitize people and many people improve their performance with practice. Almost every classroom teacher knows that part of a student' s performance on assessment tests depends on their familiarity with the format. Solution? A Solomon Four Group Design, wherein half the subjects do not receive a pretest is a good way to control inferences in this case.


While a true experiment can be higher on internal validity, by no means do all experiments have high internal validity. To enhance internal validity, the investigator must use control groups effectively, control reactivity, and scrutinize experimental reality. Further, you need to know if people noticed and comprehended your treatment or intervention in the first place.


When a new pharmaceutical drug is tested, typically all experimental subjects receive a pill.

Some receive the new active ingredient, such as a brand new antihistamine or antibiotic.
Some receive an older medicine, such as Tavist (clemestine) or Penicillin.
Yet others receive an inert "sugar pill" that has no active ingredients, or a placebo.

These control or comparison groups are an absolute necessity in any design, but certainly for an experiment.

The group receiving the older medication lets us know if the new drug (or intervention) is less effective, as effective, or more effective than treatments currently available.

The group receiving the "sugar pill" alerts us to changes that occur with the two active ingredient medication groups above and beyond a placebo effect. In a placebo effect, changes that occur are due to other factors besides the active treatment. For example, a patient might feel "safe" and "treated" if their doctor gives them a pill, even a sugar pill. Because of these psychological changes, their immune system might actually function better. This is very interesting but not what you set out to assess. So, anything that smacks of a placebo effect is a threat to internal validity and must be controlled for.

Notice that the "control group" GETS A PILL. The "nothing at all" control group is generally a very poor design. For example, if you were studying the effects of watching a violent film on aggression imitation among school children, the very act of watching any movie can be physiologically arousing. If your control group watched no movie at all, then you could not controll for these effects. So, instead, your control group watches a generally unaggressive movie such as "The Adventures of Milo and Otis" (a young dog and young cat who are friends).

THE MORAL: Design your control group carefully. See that your control group has some features in common with your treatment groups if those features could affect your outcomes (it takes a pill, sees a film, or fills out a pretest questionnaire.)


Reactivity refers to changes in the study participants' behavior simply because they are being studied.

For example, some people get nervous when a doctor or nurse takes their blood pressure, and their blood pressure goes up.

Reactivity poses a distinct threat to internal validity because we don't know what caused the outcome: treatment effects or reactivity. The experimental laboratory is probably the most reactive because people have come for an experiment and they know their behavior is under scrutiny. That is why so many experimenters use deception. They are trying to divert subject attention so that the "true behavior under study" is not altered.

Demand effects, in which subjects or respondents "follow orders" or cooperate in ways that they almost never would under their routine daily lives.

Social Desirability effects take several forms. Most people and groups (who allow you to study them at all) try to cooperate with researchers. But some try to descover the purpose of the intervention and thwart it, or "wreck the study." Social Reactance effects refer to boomerang effects in which individuals or groups "fake bad," or deliberately deviate from study procedures. This happens more among college students, and others who suspect that their autonomy is being threatened.

ON REACTIVITY AND EXTERNAL VALIDITY. If demand effects are specific to a particular situation, reactivity problems may also influence generalizing, or external validity (this is how your Wiersma book treats the term.)

However, more often, I think reactivity introduces an alternative causal explanation for our results: the results occurred, not because of the intervention or treatment, but because people were so self-conscious that they changed their behavior. This is internal validity. Reactivity may also statistically interact with the experimental manipulation. For example, if the treatment somehow impacts on self-esteem (say you are told that the stories you tell to the TAT pictures indicate your leadership ability), reactivity may be a greater internal validity problem.


More of a threat to external validity is the issue of the reality of the study setting. In many cases, such as studies of classrooms or online environments, the setting of the study is identical to the "everyday reality" or mundane reality in which most subjects live their lives. High mundane reality makes it easier to generalize to people's typical settings and it facilitates external validity. Field studies of all kinds, and ethnographies, too, take place in typical, as opposed to unusual, settings.

However, laboratory experiments in particular may use unusual settings or tasks. For example, some sports experiments will have participants on a treadmill for hours. In other studies, subjects may be injected with substances (such as adrenaline) or take pills. Subjects may see specially constructed movies that are nothing like they see on TV. Or they may be called upon to perform tasks (watching a light "move" in a darkened room) that bear no resemblance to their normal environment.While these settings or tasks may be engrossing or compelling, thus high in experimental reality, they do not resemble the settings to which researchers may really want to generalize.

I HOPE YOU USED A Manipulation Check.

YOU are certain that your intervention will make life healthier or enhance learning. But what if no one pays attention to the treatment or comprehends its message? Then it will appear that you have no effects at all, whereas if you had simply used a stronger manipulation, your guesswork might have been confirmed.

Anyone doing experimental work needs to have a manipulation check, an inclusion to measure if participants even paid attention to factors in the treatment and understood their messages. For example, if you show different movies to different groups and your topic is filmed aggression, have a short questionnaire that has participants rate the violence of the movie. The group receiving the more aggressive film should rate it as more violent than those receiving an unaggressive movie. If you are trying a new reading technique, make sure that students understand the stories they are exposed to and remember something about them. If you try a new template in your online learning course, did students even pay attention?


When the medical and pharmacy professions test a new medicine, they don't just use a "sugar pill" placebo.

Participants in the study do not know if they are taking a new medication, an old medication, or a sugar pill.

The individuals who pass out the medication and assess the subjects' health and behavior also do not know whether the person is taking a new medication, an old medication, or a sugar pill.

Thus both those involved as subjects and those involved with collecting data are "blind:" blind to the purposes of the study, the condition that subjects are in, and the results expected.

This means that

Almost no one who collects data "likes deception" but without at least a little bit of it, you may introduce reactivity and bias into your study. Do the minimum (I prefer "omission" rather than deliberate lies) and be sure to debrief participants after their participation in the study is completed.

Debriefing means that you tell participants the true purpose of the study and any manipulations pertinent to their role in it. Debriefing is ethically mandatory, and is especially important if your manipulation involved lies about the student's performance ("no, you really didn't score in the 5th percentile on that test, all feedback was bogus") or any other aspect of the "real world."

Did you make sure your participants knew enough about the study procedures IN ADVANCE to give informed consent?

Have you made it clear to participants that nothing bad will happen to them if they REFUSE to participate in the study (they won't get a bad grade, get kicked out of the Boys and Girls Club, or lose their job)?

Do you have a plan to keep the responses of participants confidential to the degree allowed by law? This means locking up surveys or experimental protocols in a cabinet or some other type of very limited access storage.

Have you provided to the participants a telephone number and email address for you and your major professor, so that they can contact either of you if they have any questions?

For more on what we as researchers owe the participants in our studies, CLICK HERE for the Human Subjects Committee information (the "IRB" or "Institutional Review Board").

Susan Carol Losh
Revised September 16 2013
This page was built with Netscape Composer