![]() |
![]() |
![]() ![]() ![]() ![]() ![]() |
||
![]() |
![]() ![]() ![]() ![]() ![]() |
![]() |
![]() |
There are dozens of possible designs to determine program impact. This section will summarize the most popular ones. Because designs without comparison groups are very deficient in terms of internal validity
Single Group DesignsThe simplest but least satisfactory evaluation design is the posttest only design, symbolized, X
O (where X is the program intervention Whenever a program is supposed to bring about a change, before-and-after measures are a necessity with one-group designs. The simplest, the pretest-posttest design symbolized as O 1 X O 2 (where X again denotes the intervention and O1 and O2 denote pre- and post-program outcome measures) requires a pretest of some sort before the program takes place (a reading test, for example), and a posttest after the program. This design is subject to most threats to internal validity. Most seriously, participants might have changed or some extraneous event may have brought about any observed difference between O 1 and O 2, so no change can credibly be ascribed to the program. For example, if an evaluation of this design showed that mean post-program earnings were lower than mean pre-program earnings, this should not be construed as proof that the program was defective: a recession may have caused the earnings drop. Time series designs involve collecting data repeatedly about participants' situation at several times. Symbolically: O 1 O 2 O 3 X O 4 . . . O n This design can be used to rule out (or at least quantify) regression and maturity as threats to
internal validity. That is, any personal trends in the absence of the program can be accounted for
and controlled. Advanced statistical procedures are required to isolate the effect of the program.
History remains the key threat. Although supplementary data on the environment can help rule
out events that can be identified, it is extremely difficult to identify In sum, one-group designs virtually preclude any serious summative evaluation. They are notoriously weak and easily dismissed, because it is generally impossible to rule out potential alternative explanations, especially key events that occurred while the treatment group was participating in the program (e.g., a recession). Two-group designs are required for sound evaluation. Two Group DesignsA valid determination of impact requires comparing outcomes of a group of individuals who have participated in the program (treatment group) with an equivalent group of people who have not participated (control or comparison group). In theory, the best way to do this is by means of a randomized experiment, where individuals are assigned at random to the treatment or control group (Rossi and Freeman, 1993). Outcomes measures, chosen on the basis of program objectives, are observed at some interval after the intervention ends, with any differences between groups attributable to the program: that is, the program can be said to have caused the observed differences. The design is symbolized as follows:
Since randomization should remove The primary threat to this design Although experimental designs are as close to ideal as possible in theory, they are seldom practical. By far the most common constraints are program staff who refuse to comply because they consider randomized selection unethical or unacceptable, and evaluation timing: most often the evaluator enters the scene long after random assignment should have taken place. There are other problems as well. Most seriously, experimental methods are normally confined to a determination of the mean impact of the program; they cannot answer many other key policy questions, including the median impact of the program and the proportion of participants with a positive (or negative) impact from the program (Heckman and Smith, 1995)2. Given these constraints, quasi-experimental (non-experimental) models are frequently the only satisfactory way to proceed. There are different quasi-experimental models, but the most common and robust method involves constructing a comparison group of individuals who are comparable to participants. This can be done: by statistically controlling for differences between groups during data analysis; by matching participants and non-participants according to key traits (such as age, sex and education) believed to influence the outcomes of interest; or both3. The idea is to approximate random assignment as closely as possible by attempting to minimize or control for differences between the groups. Symbolically4: O1 X O2 [participants] Here X is the program intervention, O 1 is a pre-program observation, and O 2 is a post-program
observation. For instance, Under a longitudinal quasi-experimental approach, for example, the evaluator compares the
outcomes of two groups: program participants (the "treatment group") and non-participants (the
"comparison group")5. "Outcomes," which relate to the objectives of the training program - finding
a job, for example For example, say a baseline survey (see below) showed that half the participants and half the non-participants were working one year prior to applying for EI; and a follow-up survey found that 70% of trainees were working one year after the training, versus 60% of the comparison group one year after leaving EI. The increment for trainees is thus 20 percentage points, versus 10 for non-trainees. Simple statistical tests would determine if this difference is significant. But, a finding of a statistically significant difference between groups does not necessarily imply that the difference was due to the program. The analyst must demonstrate that the difference is attributable to the program. That is, threats to internal validity must be ruled out. Unfortunately, the empirical evidence shows that participants are likely to be different from non-participants in ways that affect the outcome variables. Selection into most programs is non-random: those who volunteer to participate may be more motivated than those who do not, for example; and program administrators more often than not select those they feel will have the best chance of succeeding (i.e., the most talented), or conversely, select those most in need of the treatment. Regardless of its source, selection bias affects the comparability of treatment and comparison groups. As long as all differences between the groups being compared are observable (e.g., personal traits), selection bias will not be a problem because statistical methods such as multiple regression analysis can control for the differences. Researchers do their utmost to match individuals in treatment and control samples to ensure observed characteristics are very similar, but they seldom know why a person is participating in a program. If any unknown (hence uncontrolled) feature of the person or program influenced the decision to participate, then the selection is non-random and differences between participants and non-participants may be incorrectly ascribed to the program. No statistical method is likely to completely resolve the selection bias problem. Since it is
impossible to anticipate all the factors that went into the decision to participate, the surveys and
protocols cannot be designed to gather all relevant information. Quasi-experiments require
analysis techniques that are much more complicated than those for true experiments. High-level
statistics
|