Government of Canada | Gouvernement du Canada Government of Canada
    FrançaisContact UsHelpSearchHRDC Site
  EDD'S Home PageWhat's NewHRDC FormsHRDC RegionsQuick Links

·
·
·
·
 
·
·
·
·
·
·
·
 

Appendix A: True Experiments Versus Quasi-Experiments


There has been a lively debate for over a decade on whether true experiments or quasi-experiments represent the superior design. This appendix will present each side's argument.

The Experimental Camp

Those favouring the experimental approach question the validity of highly sophisticated non-experimental approaches for assessing the impact of training and employment programs because of the apparent difficulty in obtaining reliable estimates of the impact of labour market programs. They claim that economists using quasi-experimental techniques have had little success in isolating program effects (i.e., removing the "selection bias").

Scholars such as Ashenfelter and Card (1985), Barnow (1987), and LaLonde and Maynard (1987) contend that results of dozens of econometric studies were so varied that there may be no sound way to adequately measure program effects short of an experimental evaluation with random assignment to training or control groups.

The findings of Lalonde (1986) and Lalonde and Maynard (1987) were exceptionally damaging to the case for non-experimental designs. To assess the accuracy of such designs, they compared the results from a true experiment of the National Supported Work (NSW) Demonstration Project to those derived from various widely used non-experimental procedures to see if they could accurately estimate the true program impacts. They concluded that "there does not appear to be any formula [using non-experimental methods] that researchers can confidently use to replicate experimental results of the Supported Work Program." Using the same evidence, Fraker and Maynard (1987) concluded:

This analysis demonstrated that results may be severely biased depending on the target population, the comparison group selected, and/or the analytic model used. More importantly, there is at present no way to determine a priori whether comparison group results will yield valid indicators of the program impacts (p. 216).

More recent work confirms the chief conclusions drawn from this work ­ that quasi-experimental estimators are biased and sensitive to minor changes in model specification. Friedlander and Robins (1995) assessed two conventional quasi-experimental strategies against experimental data from four social assistance reform experiments: comparing the treatment group in one locale to a comparison group in another locale; and comparing outcomes of the treatment group with a pre-program comparison group in the same area. They concluded that the non-experimental estimates were "usually quite different from the experimental estimates," especially for the comparison samples drawn from different areas. They also studied two statistical techniques for improving the accuracy of quasi-experimental estimates: statistical matching to produce closely matched comparison groups; and "specification tests," to statistically assess the econometric model employed to determine if the estimates it yields are accurate7. Neither strategy markedly improved the accuracy of the non-experimental estimates.

Further support for the experimental method comes from Greenberg and Wiseman (1992), who stated that fifteen years of experimentation (e.g., income maintenance experiments, Supported Work Demonstration) have demonstrated that random assignment was "a methodologically superior approach to program evaluation," and provided evidence that such studies were feasible. Moreover, the work on the Omnibus Budget Reconciliation Act evaluations (by the Manpower Demonstration Research Corporation) convinced decision-makers at the U.S. Department of Health and Human Services -- which usually funds evaluations of training programs -- that random assignment should be used as the foundation of social assistance reform evaluations. MDRC's evaluations were generally considered excellent, helping to establish random assignment as the procedure of choice.

Another advantage of experimentation is that the results are understandable and convincing to policy makers (Burtless, 1995). Without the complicated qualifications associated with quasi-experiments, analysts can present straightforward experimental findings such as "the program raised annual earnings of participants by $1,000." This simplicity makes it more likely that policy makers will use the evaluation findings. "They do not become entangled in a protracted and often inconclusive scientific debate about whether the findings of a particular study are statistically valid. Politicians are more likely to act on results they find convincing" (Burtless, 1995, p.67).

Finally, a National Academy of Sciences panel recommended the following conditions as necessary (but not sufficient) for quality research: the use of random assignment to group; reasonable operational stability of the program prior to final assessment; adequate sample coverage and low rates of sample attrition; outcome measures that well represent program objectives, both immediate and longer-term; and a follow-up period that allows time for program effects to emerge or decay (Gueron and Pauly, 1991).

These studies, among others, have convinced many evaluators that experimental estimators are better than quasi-experimental ones (e.g., Burtless 1995; and Friedlander and Robins 1995).

The Quasi-Experimental Camp

Others, however, have contested the view that experimental designs are superior to non-experimental designs. In the forefront of this group are James Heckman and his colleagues (e.g., Heckman, Hotz, & Dabos, 1987; Heckman and Smith, 1995). They argue that with a sufficiently rich data set and appropriate econometric modeling techniques, it is possible to arrive at reliable estimates of impact.

To mount a serious challenge to the emerging consensus that experimentation was the better route, they had to surmount the conclusion of Lalonde and his associates that no available quasi-experimental method produced estimates close to those of the unbiased experimental estimates. Heckman and Hotz (1989) applied some simple specification tests to models estimated on data from the NSW experiment and found that the most inaccurate models could be rejected, leaving a subset of models that produced impact estimates close to those of the experimental results. Heckman and Smith (1995) presented some cogent arguments that undermine to some extent Lalonde's findings: sample sizes were too small and insufficient geographical data were available to place comparison group members in the same local labour market as participants; there was only one year of pre-program data available, ruling out potentially effective econometric strategies (and leaving the estimates subject to the "Ashenfelter Dip"8); the studies did not use a variety of standard specification tests; and there have been important advances in non-experimental methods since the these studies were done. For instance, Heckman and Smith (1995b) found that labour force dynamics (i.e., movements between employment, unemployment and into and out of the labour force), as well as other contributing factors such as age, education, marital status and family income, can be used to form comparison groups that are "virtually identical" to treatment groups, thereby controlling for selection bias.

Heckman and Smith (1995) also raise some serious objections to the experimental method, both theoretical and empirical:

  • Randomization may alter the pool of persons eligible for the program or change the behaviour of participants, the so-called "randomization bias." For instance, in order to form a control group, occasionally the pool of potential participants must be expanded, usually by relaxing some eligibility criteria.

  • If close substitutes for the experimental treatment are available, the assumption that the control group receives no treatment is annulled to the extent members take advantage of the training, generating a "substitution bias." For instance, some training programs consist of buying seats at community colleges; non-participants can choose to attend the same college course if they pay on their own or find another source of funding.

  • Experimental data cannot answer many of the questions of central interest to policy makers, including the median impact of the program and the proportion of participants with a positive (or negative) impact from the program. No parameters that depend on the joint distribution of outcomes in the treatment and control groups can be estimated. "Only if the evaluation problem is defined exclusively in terms of means can it be said that experiments provide a precise answer" (Heckman and Smith, 1995, p. 22).

  • Experimental evaluations tell policy makers whether the programs work or not. They do not typically shed any light on why a program worked or did not work.

  • Institutional factors make it hard to carry out random assignment at all in some cases and difficult to do an optimal job in others. Staff can subvert the process because the are opposed to random assignment. Costs can militate against making the selection at the optimal point in the decision process9. Multi-stage randomization, which is expensive and disruptive to the program, may be required to generate estimates of the impacts of different services.

  • Complementary non-experimental analyses are often required to overcome the short-comings of experimental models.

Burtless (1995) agrees that experimental evaluations have such shortcomings, but counters that quasi-experiments also share some of them (e.g., quasi-experiments can also be costly and disruptive); plus they are "usually plagued by more serious statistical problems than those that occur in randomized trials." Moreover, he adds, that nothing inherent to the experimental design precludes researchers from using non-experimental methods to analyze data.

Conclusion

The jury is still out as to whether quasi-experimental designs can adequately control selection bias. It is safe to conclude that experimental designs are superior in this critical respect. But the many problems associated with experiments render them impractical for many if not most evaluations. Quasi-experimental designs are often the best practical approach to take for evaluations of training programs.


Footnotes

7 For example, one can see if the model correctly predicts no differences in outcomes between trainees and non-trainees during the period before the program began. If the model finds statistically significant differences between groups before the program, it should be dropped since it failed the specification test. [To Top]
8 This refers to the phenomenon that the earnings of training program participants tend to dip just before they enter training, because unemployment is often the impetus to take the course; thus, before and after difference estimators will tend to overestimate the effect of the program (Ashenfelter and Card, 1985). [To Top]
9 Randomization can occur at any step of the process: participant becomes eligible, becomes aware of the program and his/her eligibility for it, applies, is accepted, is assessed by staff, is assigned to particular services, begins receiving the services, and completes the program. The optimal place for randomization depends on the evaluation questions at issue. For determining mean impact, the assignment should take place as close as possible to the commencement of training to minimize attrition. [To Top]


[Previous Page][Table of Contents][Next Page]