HRDC - Quasi-Experimental Evaluation

Directorates

· Director General's Office

· Data Development

· Strategic Evaluations

· Partnership and Continuous Evaluations

Publications

· Evaluation Reports

· Evaluation Briefs

· Labour Market Development Agreement Series

· Lessons Learned Series

· Evaluation Tool Kit Series

· Evaluation Forum Newsletter

· From Theory To Practice

4.0 Types of Designs

There are dozens of possible designs to determine program impact. This section will summarize the most popular ones.

Because designs without comparison groups are very deficient in terms of internal validity that is, they cannot normally rule out alternative explanations for program outcomes observed all good summative evaluations include a comparison group. Nevertheless, single group designs are very common.

Exhibit 1 - Threats to Internal Validity

Threats due to real changes in the environment or in participants:
History: Changes in the environment that occur at the same time as the program and will change the behaviour of participants (e.g., a recession might make a good program look bad).
Maturation: Changes within individuals participating in the program resulting from natural biological or psychological development.
Threats due to participants not being representative of the population:
Selection: Occurs when assignment to participant or non-participant groups yield groups with different characteristics. Pre-program differences may be confused with program effect.
Mortality: Participants dropping out of the program. Drop-outs may be different from those who stay.
Statistical Regression: The tendency for those scoring extremely high or low on a selection measure to be less extreme during the next test. For example, if only those who scored worst on a reading test are included in the literacy program, they might be bound to do better on the next test regardless of the program just because the odds of doing as poorly next time are low.
Threats generated by evaluators:
Testing: Effects of taking a pretest on subsequent post-tests. People might do better on the second test simply because they have already taken it ¹.
Instrumentation: Changes in the observers, scores, or the measuring instrument used from one time to the next.

Single Group Designs

The simplest but least satisfactory evaluation design is the posttest only design, symbolized, X O (where X is the program intervention e.g., the training course and O is a post-program observation such as annual earnings). Here, participants, having completed the program of interest, are surveyed to find out how well they are doing with respect to the behaviours or attitudes at issue. This design cannot be used to credibly attribute any effects to the program, for there is no objective basis to suppose that the program caused any changes. Indeed, because there is no information on the pre-program level of the variable(s) of interest, this design yields no information on change.

Whenever a program is supposed to bring about a change, before-and-after measures are a necessity with one-group designs. The simplest, the pretest-posttest design symbolized as O ₁ X O ₂ (where X again denotes the intervention and O1 and O2 denote pre- and post-program outcome measures) requires a pretest of some sort before the program takes place (a reading test, for example), and a posttest after the program. This design is subject to most threats to internal validity. Most seriously, participants might have changed or some extraneous event may have brought about any observed difference between O ₁ and O ₂, so no change can credibly be ascribed to the program. For example, if an evaluation of this design showed that mean post-program earnings were lower than mean pre-program earnings, this should not be construed as proof that the program was defective: a recession may have caused the earnings drop.

Time series designs involve collecting data repeatedly about participants' situation at several times. Symbolically:

O ₁ O ₂ O ₃ X O ₄ . . . O _n

This design can be used to rule out (or at least quantify) regression and maturity as threats to internal validity. That is, any personal trends in the absence of the program can be accounted for and controlled. Advanced statistical procedures are required to isolate the effect of the program. History remains the key threat. Although supplementary data on the environment can help rule out events that can be identified, it is extremely difficult to identify let alone quantify all possible events that could have brought about the outcome observed. Consider the following example. Say Province A implemented a large-scale training program for its social assistance clients and observed the social assistance caseload statistics for several months before and after the intervention to see if the training program was lowering dependence on social assistance. But, at around the same time, Province B instituted its own policy change: a cutback in social assistance benefits for employable clients. That could precipitate an inflow of social assistance clients from Province B to Province A. Unless Province A knew about the policy change in Province B and took steps to measure its impact, the time-series evaluation could underestimate any positive impact of the training program.

In sum, one-group designs virtually preclude any serious summative evaluation. They are notoriously weak and easily dismissed, because it is generally impossible to rule out potential alternative explanations, especially key events that occurred while the treatment group was participating in the program (e.g., a recession). Two-group designs are required for sound evaluation.

Two Group Designs

A valid determination of impact requires comparing outcomes of a group of individuals who have participated in the program (treatment group) with an equivalent group of people who have not participated (control or comparison group). In theory, the best way to do this is by means of a randomized experiment, where individuals are assigned at random to the treatment or control group (Rossi and Freeman, 1993). Outcomes measures, chosen on the basis of program objectives, are observed at some interval after the intervention ends, with any differences between groups attributable to the program: that is, the program can be said to have caused the observed differences. The design is symbolized as follows:

X O [participants]
O [non-participants]

Since randomization should remove at least on average any systematic differences between the groups, no pretest is needed. The impacts of the treatment can be measured simply by comparing the mean outcomes for treatment and control groups, with chance differences largely accounted for through standard statistical techniques (Greenberg and Wiseman, 1992).

The primary threat to this design assuming the randomization process was carried out correctly is non-random mortality or attrition (i.e., members of the participant group drop-out before completing the program, or members of participant and control groups cannot be located for follow-up purposes for reasons that are not random and therefore potentially systematically related to the impacts of the program). For this reason, a pretest is often given to both treatment and control groups (Mark and Cook, 1984, hold this is "essential"), so the effects of discontinuation from the program can be quantified and accounted for in the analysis.

Although experimental designs are as close to ideal as possible in theory, they are seldom practical. By far the most common constraints are program staff who refuse to comply because they consider randomized selection unethical or unacceptable, and evaluation timing: most often the evaluator enters the scene long after random assignment should have taken place. There are other problems as well. Most seriously, experimental methods are normally confined to a determination of the mean impact of the program; they cannot answer many other key policy questions, including the median impact of the program and the proportion of participants with a positive (or negative) impact from the program (Heckman and Smith, 1995)².

Given these constraints, quasi-experimental (non-experimental) models are frequently the only satisfactory way to proceed. There are different quasi-experimental models, but the most common and robust method involves constructing a comparison group of individuals who are comparable to participants. This can be done: by statistically controlling for differences between groups during data analysis; by matching participants and non-participants according to key traits (such as age, sex and education) believed to influence the outcomes of interest; or both³.

The idea is to approximate random assignment as closely as possible by attempting to minimize or control for differences between the groups. Symbolically⁴:

O₁ X O₂ [participants]
-------------
O₁ O₂ [non-participants]

Here X is the program intervention, O ₁ is a pre-program observation, and O ₂ is a post-program observation. For instance,
O₁ could be annual earnings in 1995, O₂ could be annual earnings in 1997, and X could be a 1996 training program.

Under a longitudinal quasi-experimental approach, for example, the evaluator compares the outcomes of two groups: program participants (the "treatment group") and non-participants (the "comparison group")⁵. "Outcomes," which relate to the objectives of the training program - finding a job, for example are usually determined via a follow up survey, conducted months or even years after program exit. The post-program outcome for each group is compared to pre-program status to determine if there has been any change, on average, within each group. Then, standard statistical procedures determine whether the change differs significantly between groups.

For example, say a baseline survey (see below) showed that half the participants and half the non-participants were working one year prior to applying for EI; and a follow-up survey found that 70% of trainees were working one year after the training, versus 60% of the comparison group one year after leaving EI. The increment for trainees is thus 20 percentage points, versus 10 for non-trainees. Simple statistical tests would determine if this difference is significant.

But, a finding of a statistically significant difference between groups does not necessarily imply that the difference was due to the program. The analyst must demonstrate that the difference is attributable to the program. That is, threats to internal validity must be ruled out. Unfortunately, the empirical evidence shows that participants are likely to be different from non-participants in ways that affect the outcome variables. Selection into most programs is non-random: those who volunteer to participate may be more motivated than those who do not, for example; and program administrators more often than not select those they feel will have the best chance of succeeding (i.e., the most talented), or conversely, select those most in need of the treatment.

Regardless of its source, selection bias affects the comparability of treatment and comparison groups. As long as all differences between the groups being compared are observable (e.g., personal traits), selection bias will not be a problem because statistical methods such as multiple regression analysis can control for the differences. Researchers do their utmost to match individuals in treatment and control samples to ensure observed characteristics are very similar, but they seldom know why a person is participating in a program. If any unknown (hence uncontrolled) feature of the person or program influenced the decision to participate, then the selection is non-random and differences between participants and non-participants may be incorrectly ascribed to the program.

No statistical method is likely to completely resolve the selection bias problem. Since it is impossible to anticipate all the factors that went into the decision to participate, the surveys and protocols cannot be designed to gather all relevant information. Quasi-experiments require analysis techniques that are much more complicated than those for true experiments. High-level statistics "econometric models" are required to deal with the differences between groups and isolate the effect of the program.
Footnotes

¹ Also, taking a pretest may sensitize participants to a program. Participants may perform better simply because they know they are being tested -- the "Hawthorne effect."

² Appendix A briefly explores the on-going debate on the choice between experimental and quasi-experimental methods.

³ Rubin (1979) showed that the use of both techniques -- matching and statistical adjustment -- was better than use of either technique alone (as reported in Dickinson et al, 1987).

⁴ Note that pre-program information is not strictly necessary for some comparison group designs -- see section below on addressing selection bias -- but it is always desirable.

⁵ Appendix B presents an elementary mathematical representation of the quasi-experimental method.

Last Modified: 1999-03-23	Important Notices