The Design of Summative Evaluations for the Employment Benefits and Support Measures (EBSM)(For the Human Resources Development Canada (HRDC) Expert Panel on Summative Evaluation Design for the EBSM Program)Prepared by:
|
Human Resources Development Canada (HRDC) and its provincial and territorial partners recognize the importance of evaluation evidence in helping them assess the impacts of the Employment Benefits and Support Measures and the Labour Market Development Agreements (EBSM/LMDA). Provision for evaluation of these is found in both the Employment Insurance legislation and in the LMDAs.
Generally, the evaluation of employment interventions is a complex undertaking. In the case of the summative evaluations of the EBSM/LMDA, the complexity is compounded by a large diversity in communities and delivery models across the country, the number of individual evaluation studies to be undertaken, and the timing of each.
Evaluation studies involve a blend of creative thinking and in-depth knowledge of measurement techniques. This is so partly because they must combine knowledge of complicated labour market and economic theories, intricate models of program delivery and learning, elaborate sets of expected and unexpected potential impacts, and highly structured statistical models to estimate outcomes and impacts from partial lists of concepts that often cannot be measured directly.
By their very nature, EBSM/LMDA evaluations involve a large number of partners with a wide range of interests. They include policy makers, program managers, delivery officers, consultants, and citizens (such as program participants and taxpayers). These groups possess different levels of understanding of evaluation principles, standards, techniques and interpretations of the evidence. For the evaluations to succeed, it is essential to establish a high level of trust among all partners/interested parties. Therefore, it is important that the methodology be highly defensible and that it be presented in a lucid and fully transparent manner.
HRDC engaged an expert panel to develop a generic, objective, state-of-the-art framework for the evaluation of the EBSM/LMDAs. The framework was developed by the experts to provide a common and sound methodological foundation for the various evaluations. A guiding principle was that it be flexible enough to meet requirements for reports at the national, provincial, territorial and local levels. A key feature is that it is comprehensive and readily adaptable to specific operational circumstances and subject matter content. The proposed methodology is based on the latest theoretic and applied evaluation approaches and views and experiences of the experts with respect to their application.
The Panel's proposed framework is summarized in the body of this report. It should facilitate agreement with respect to methodology so that at the time of reporting the focus will be on interpretation of the evidence in terms of likely impacts of the EBSM/LMDAs. The body of this report is devoted almost exclusively to methods of measuring outcomes, and not to what ought to be measured. The proposed measurement framework can be applied to a wide range of outcomes (for example, those discussed in Annex A).
HRDC would like to thank Mr. Walter Nicholson, Mr. Jeff Smith, Mr. Chris O'Leary, Mr. Paul Decker, and Mr. Joe Michalski for their excellent work.
HRDC offers this report in the spirit of openness to advance the understanding and facilitate discussions of evaluation methodology, and to ensure the best set of possible summative evaluations of the EBSM.
This report presents a general framework that can form the core of any well-designed evaluation of the EBSMs1. It identifies questions that should be addressed, offers assessments of strengths and weaknesses of various approaches for measurement, and highlights issues toward which more research efforts might be directed.2
Even though the paper does go into considerable depth on some matters, it should not be read as a definitive detailed plan for any given evaluation as such. As discussed in the paper, the framework will have to be adapted (modified and extended) to reflect features that may be unique to the particular requirements of any given evaluation. For example, any final design for an evaluation must ultimately depend on the goals set for that evaluation, the nature of the programmatic interventions to be studied, the quality of the data available, and on such practical considerations as time and resource availability.
An important consideration for the evaluation of the EBSMs concerns the reporting and ultimate perceived validity of evaluation evidence. Although this paper attempts to describe the "state-of-the-art" with respect to evaluation methodology, it does not pretend to be the final word on what a "good" evaluation is, nor how the evidence issuing from such methodology should be used. Assessing the overall validity of the evidence relies, in large measure, on: (1) An a priori understanding of the applicability and limitations of the general approach being used (e.g., limitations of quasi-experimental designs); and (2) An ex post appraisal of the quality of the data collected and of robustness of the estimation strategies used. This is especially important in the context of the EBSM evaluations where it is likely that a major component of most evaluations will be to provide evidence to assist in the assessment of the "incremental effects" of the interventions being offered. While well-understood and organized data can contribute a great deal, the problems associated with deriving reliable estimates of such incremental effects in non-experimental contexts are pervasive. In many cases it may simply prove to be impossible to provide valid results based solely on the available data. Hence, planned evaluations must include procedures for the assessment of strengths and weaknesses together with specification of a clear process for the integration of results and their dissemination to wider audiences.
These cautionary notes are an integral part of the rationale for the framework articulated in this paper and are re-enforced throughout. A proper understanding of the Panel's suggestions requires that they be viewed within this cautionary context. Recognizing the potential strengths and weaknesses associated with various approaches in the development of the summative evaluations will be instrumental in the proper interpretation and presentation of the evidence gathered and reported in future.
The paper is divided into five analytical sections that focus on the following general conceptual points:
A final section (F) summarizes some of the issues raised here that should be kept in mind when designing quasi-experimental evaluations and includes some guidelines regarding implementation of the EBSM evaluations.
In order to structure a clear analysis of the likely effect of the interventions that constitute the EBSM program, one needs to develop a precise definition of what that program is and how "participants" in it are to be identified. Two general factors guided the Panel in its considerations of these issues: (1) The program and its constituent interventions should be defined in ways that best reflect how services are actually delivered; and (2) Participation should be defined in a way that both reflects individuals' experiences and does not prejudge outcomes. Given these considerations, the panel made the following observations and suggestions:
1. A specification under which program participants are defined based on Action Plan start dates seems to best meet evaluation goals. The panel believed that the Action Plan concept best reflects the overall "case management" approach embodied in the EBSM program. Although participants actually experience quite different specific interventions, most share a common process of entry into the program together with a conscious direction to services that are deemed to be most beneficial. Hence, the Action Plan notion (although it may not be explicitly used in all regions) seemed to correspond most directly with what policy-makers believe the "program" to be.
The use of start dates as a way of identifying program participants appeared to the Panel to offer a number of advantages. First, it seems likely that start dates are defined in a relatively unambiguous way for all program participants. The alternative of focusing on participants who are ending the program during a specific period seemed more problematic, in part because Action Plan end dates are often arbitrarily defined. A second reason for preferring a start date definition of participation is that it seems likely that such a definition would better match-up with other published data, such as those in the Monitoring and Assessment Reports. Finally, opting for a start date definition of participation seemed to provide a conceptually clearer break between pre-program activities and "outcomes" because, for some purposes, everything that occurs after the start date can be regarded as "outcomes". That is, the in-program period is naturally included so that opportunity costs associated with program participation can be included as part of an overall assessment.
Of course, the Panel recognized that some regions do not use a case-management/Action Plan approach to delivering EBSM interventions. They also recognized that basing a definition of participation on start dates poses conceptual problems in cases of multiple interventions or when there is policy interest in differentiating between in-program and post-program outcomes. Although the Panel does not believe that these potential disadvantages negate the value of their suggested approach (which seems flexible enough to accommodate a wide spectrum of actual programmatic procedures3), it did believe that designers of specific evaluations should explore these issues again in the light of actual program processes and data availability.
2. In general, evaluations should focus on individuals who have participated in an "Employment Benefit". The four employment benefits ("EB's" — that is, Targeted Wage Subsidies, Self Employment, Job Creation Partnerships, and Skills Development) constitute the core of EBSM offerings. They are also the most costly of the interventions offered on a per participant basis. Therefore, the Panel believed that the evaluations should focus on participants in these interventions. It seems likely that regions will wish to assess the impacts of each of these interventions separately. That desire poses special problems for the design of the evaluations. These are discussed in detail in Section C below. There we show that optimal designs for achieving the goal of assessing individual interventions may differ depending on the specific interests of regional policy-makers. The Panel also noted that some consideration might be given to including a separate group of participants in Support Measures only in some evaluations. That possibility is discussed below.
3. "Participation" requires a precise definition — perhaps defined by funding (if feasible). The Panel expressed concern that a number of EBSM clients may have start dates for specific Employment Benefits (or for Action Plans generally) but spend no actual time in any of the programs. Assuming that such "no shows" are relatively common (though some data should be collected on the issue), the panel believed that there should be some minimal measure of actual participation in an intervention. One possibility would be to base the participation definition on observed funding for an individual in the named intervention. Whether individual-specific funding data for each of the EB interventions are available is a question that requires further research. It seems likely that funding-based definitions of participation would differ among the regions because ways in which such data are collected would also vary.
4. Program completion is not a good criterion for membership in the participant sample. Although it might be argued that it is "only fair" to evaluate EBSM based on program completers, the Panel believed that such an approach would be inappropriate. Completion of an EB is an outcome of interest in its own right. It should not be a requirement for membership in the participant sample.
5. A separate cell for Employment Assistance Services (EAS)-only clients should be considered in some regions. The EBSM program allocates roughly one-third of its budget to Support Measures and these should be evaluated in their own right. Prior research has shown that relatively modest employment interventions can have the most cost-effective impacts (see Section C). Hence, the Panel believed that dropping participants with only an SM intervention from the analysis runs the danger of missing quite a bit of the value of the overall EBSM program. That is especially true in regions that place a great deal of emphasis on Support Measures. It also seems possible that including an EAS-only sample cell might aid in estimating the impact of the EB interventions themselves. This is so because an EAS-only sample may, in some circumstances, prove to be a good comparison group for specific EB interventions. Many other evaluations have adopted a tiered approach to defining treatments in which more complex treatments represent "add-ons" to simpler ones. In some locations the EBSM may in fact operate in this tiered way. A few further thoughts on potential roles for an EAS-only treatment cell are discussed in Sections B and C; the Panel believes that most evaluations should consider this possibility.
6. Apprentices should be the topic of a separate study. Although the provision of services to apprentices is an important component of the EBSM program, the Panel believed that the methodological issues that must be addressed to study this group adequately would require a separate research agenda. Some of the major issues that would necessarily arise in such a study would likely include: (1) How should apprentices' spells of program participation be defined? (2) How is a comparison group to be selected for apprentices — is it possible to identify individuals with a similar degree of "job attachment"? And (3) How should outcomes for apprentices be defined? Because the potential answers to all of these questions do not fit neatly into topics that have been studied in the more general employment and training literature, the Panel believed that simply adding an apprenticeship "treatment" into the overall summative evaluation design would yield little in the way of valuable information. Such a move would also detract from other evaluation goals by absorbing valuable study resources.
The Panel recognized that assessing the incremental effects of Employment Benefits and Support Measures (EBSM) interventions is an appropriate goal.4 Doing so in a non-experimental setting (i.e., using quasi-experimental designs), however, poses major challenges. The methodological issues associated with deriving such "incremental effects" in the form of definitive estimates of the effect of an intervention on participant outcomes using comparison groups are profound. In many cases, based on available data, it may simply be impossible to provide definitive evidence regarding "true" impacts. In such circumstances, results should be cast in the form of evidence to be used as "indicators of potential impacts", under certain clearly specified conditions. Any extension of the results beyond those conditions should be subjected to careful scrutiny and critical analysis.
The issues may be illustrated as follows: a goal of a summative evaluation is to obtain "consistent" and relatively "precise" estimates of the outcomes for participants and comparison groups. That is, the methodology should yield estimates that, if samples were very large, would converge to the true (population) values. And the actual estimates should have a small sampling variability so that the estimates made have only small probabilities of being far from the true values. Indeed, managers often expect that the evaluation can provide consistent and precise estimates of the "incremental effect" (i.e., the effect of the program on outcomes relative to what would have happened in the absence of the intervention). The Panel recognized the difficulty of achieving these goals using a quasi-experimental design. Inherent in such a design is the basically unanswerable question of how similar comparison groups really are, especially when differences between groups may be embedded in characteristics not directly observable in the evaluation. The Panel believed that it is not possible to specify on a priori grounds one "best" approach that will optimally provide proper estimates in all circumstances. Hence, the Panel recommended that evaluators take a broad-based and varied approach, exploring a variety of methodologies both in the planning phase and in the analysis phase of their work. It also strongly believed that the approaches taken should be carefully documented and critically compared and that evaluators should be encouraged to suggest a best approach.
The Panel believed that a number of approaches to measuring the outcomes of participants and non-participants in EBSM interventions seem feasible for the summative evaluations.
To ensure that all possibilities were considered, it developed a rather exhaustive list of the possibilities. What follows is a listing of those possibilities with some critical comments on each.
Random Assignment ("Experimental") Methods: Random assignment remains the "gold standard" in labour market evaluations. The procedure guarantees consistent and efficient estimation of the average effect of a treatment on the treated5 and has become the standard of comparison for all other approaches (see, for example, Lalonde 1986 and Smith and Todd 2001). Because of these advantages, the Panel strongly believed that the possibility for using random assignment in some form in the summative evaluations should be considered. Of course, the Panel recognized that the primary objection to random assignment is that it conflicts with the fundamental universal access goal of all EBSM programs. To the extent that individuals selected for a control group would be barred from EBSM program participation (at least for a time), this denial of service would create an irreconcilable difference between evaluation needs and program philosophy. The usual solution to such conflicts is to design treatments in such a way that they represent an enhancement over what is normally available in the belief that denial of enhancements is not so objectionable on philosophical grounds. If funding for specific types of interventions is limited, such interventions themselves might be considered "enhancements" so random assignment in such situations may also be supportable. Other constraints (say, on program capacity) may also provide some basis for structuring random assignment evaluations in ways that do not conflict with universal program eligibility.
The Panel generally concluded, based on evidence from experiences with the EBSM program in the field, that such options are not common within the program as currently constituted, however. Still, the Panel felt that evaluators should always investigate the possibilities for structuring a random assignment evaluation first before proceeding to second-best solutions. Evaluators should also report on where random assignment evaluations might be most helpful in clarifying ambiguous evidence upon which assessments of impacts may be claimed. This might serve to highlight possible ways in which random assignment might be most effectively used to supplement the other research initiatives directed toward the EBSM program.
Random assignment should also be considered for purposes other than the forming of traditional treatment-control designs. For example, a variety of increasingly extensive program enhancements could be randomly assigned on top of a universal service option. Or random assignment could be used to create "instruments" that help to identify program participation in otherwise non-experimental evaluations. This could be accomplished by, say, imposing additional counseling requirements on a random group of EI recipients and this might generate random variation in training receipt. A similar goal could be achieved through the use of randomly assigned cash bonuses. The main point is that random assignment could be used in a variety of innovative ways to improve the quality of the EBSM evaluations. Because of the additional validity that can often be obtained through random assignment, the panel would enthusiastically support such innovative efforts.
Non-Experimental Methods: A wide variety of non-experimental methodologies have been used for evaluating labour market programs when random assignment is infeasible. Most of these utilize some form of comparison group in the hope that experiences of comparison6 group members faithfully replicate what would have happened to program participants had they not been in the program. We begin with a relatively simple listing of the possibilities. This is followed, in the next subsection, with a more detailed discussion of how comparison group methods interact with measurement methods in determining whether consistent estimates of outcomes can be achieved. Because, as we show, differences in performance of the methods depend in part on whether participants and comparison group members differ along dimensions that are observed or unobserved, we use this distinction to illustrate the approaches.
Matching Methods. A variety of matching strategies might be employed in the EBSM summative evaluations9. As we discuss in the next section, these could be based only on administrative data or on some combination of administrative and survey data. While matching does not directly control for unmeasured differences between participants and comparison group members, it may approximately do so if the unobservable factors are correlated with the variables used for the matching10. Adoption of matching procedures would, as we show in Section C, have consequences for the sample allocation in the evaluations — primarily by increasing the size of comparison groups to allow for the possibility that some comparison group members included in the initial sample would not provide a good match to any participant. Adoption of matching strategies would also pose logistical problems in the collection of survey data for the evaluation and these also are discussed in Section C.
Two general approaches to matching have been employed in the literature:
Methods that Control for Unobservable Differences
These methods also proceed from assumptions about the nature of unobservable differences between groups of participants and non-participants. Then, on the basis of these assumptions, the methods attempt to account for and eliminate these a priori differences through subtraction, complex modeling, or careful selection of comparison groups on the basis of detailed characteristics. The validity of the estimates depends on the quality of the data availability, the extent to which the assumptions about unobservable factors hold, and the robustness of the techniques. It is important, therefore, that evaluators, in their assessments of the impacts of the EBSM, take into account deviations from the underlying assumptions and their potential effects on the corresponding estimates.
Specific approaches that attempt to control for unobservable factors include:
The techniques described in the previous subsection have often been used in various ways in evaluations. In some cases, one procedure is adopted on an a priori basis and that method is the only one used. In other cases researchers have adopted an eclectic approach in which they may do some matching, a bit of regression analysis, and, for good measure, throw in some IV techniques. In general, the Panel feels that neither of these approaches is ideal. The selection of a single analytical approach to an evaluation may often prove inadequate once the data are examined in detail. But the adoption of a "kitchen sink" approach may lead to the adoption of techniques that are incompatible with each other. What seems required instead is some detailed rationale for any of the approaches chosen in a particular evaluation accompanied by an ex post appraisal of the likely validity of the approaches taken.
To aid in appraising issues related to the selection of analytical approaches to the evaluations, Table 1 explores the interaction between comparison group choice and estimation methods in some detail. The table considers three specific comparison group possibilities according to how these members are to be matched14 to participant group members:
Each of these comparison group methods is related to four potential approaches that might be used to generate actual estimates of the differences in outcomes of interest between participant and non-participant groups:
Comparison Group | Consistency15 if Participant/Comparison Groups Differ in | |||
V1 only | V1 or V2 | V1or V2orV3 | V1or V2or V3or U16 | |
Simple Difference of Means | ||||
V1 Match | Yes | No | No | No |
V1,V2 Match | Yes | Yes | No | No |
V1-V3 Match | Yes | Yes | Yes | No |
Regression Adjusted Differences in Means*** | ||||
V1 Match | Yes | Yes* | Yes* | No |
V1,V2 Match | Yes | Yes | Yes* | No |
V1-V3 Match | Yes | Yes | Yes | No |
Difference-in-Difference Estimates | ||||
V1 Match | Yes | Yes* | Yes* | Yes - If U Time Invariant. No otherwise |
V1,V2 Match | Yes | Yes | Yes* | Yes - If U Time Invariant. No otherwise |
V1-V3 Match | Yes | Yes | Yes | Yes - If U Time Invariant. No otherwise |
IV Estimators | ||||
V1 Match | Yes** | Yes** | Yes** | Yes if instrument is good and IV procedure not compromised by matching17 |
V1,V2 Match | Yes** | Yes** | Yes** | Yes if instrument is good and IV procedure not compromised by matching |
V1-V3 Match | Yes** | Yes** | Yes** | Yes if instrument is good and IV procedure not compromised by matching |
*OLS adjustment will yield consistent estimates in these cases, though
these will be less efficient than would be estimates based on matching
using all measurable variables. Notice also that matching on all available variables would generally yield consistent estimates whereas OLS
adjustment might not because of the linearity assumption implicit in OLS. **Adoption of extensive matching algorithms may affect the efficiency of IV estimation methods. ***It is assumed that OLS regressions would include all three types of measurable variables whereas matching would be only on the basis of the sets of variables identified in the table. |
In order to understand this table, consider two examples. First, suppose that participants and comparison group members differ only along dimensions that are easily measured in the complete administrative data (that is, they differ only in V1 and V2 measures). In this case the third column of the table — labeled "V1 or V2"—is the relevant situation and the entries in Table 1 show that virtually all of the estimation procedures would work quite well no matter how the samples are matched18. In this case a first best solution would be to match on both V1 and V2 although regression-adjusted means with relatively modest (V1) matching might do equally well. Alternatively, in the more realistic case where participant and comparison group members differ along unmeasured dimensions (labeled "U"), the estimation procedures are quite varied in their performance. One approach that has been suggested at times, for example, is to use matching from administrative data together with IV techniques to control for unobservable variables. That choice is reflected in the lower right hand corner of Table 1. The information in the table makes two important points about this analytical choice:
Of course, these statements apply to just one potential analytical choice. Any others that might be suggested should be subjected to a similar analysis. It seems likely that the outcome from such an extended assessment of a variety of approaches would be that "it all depends on the nature of the unobservable variables". For this reason, the Panel believed that, to the extent feasible, a number of different analytical strategies should be explored in detail during the design phases of each evaluation and that several of the most promising approaches should be pursued in the actual analysis.
Not only is it desirable that evaluation estimates of program impacts should be consistent, but, to be useful, they should also be relatively precise. In other words, the range of true impacts that is consistent with the estimates derived from an evaluation should be relatively narrow. If not, this range may be so broad that it is not possible to conclude anything about the effectiveness of a program — indeed an impact of zero (or even a negative impact) may be plausibly inferred from the data. Four general points should be made about achieving precision in the evaluations. First, achieving consistency is a goal that is logically prior to considerations of precision. If an evaluation offers no hope of providing consistent estimates it would do little good19 to be assured that the (inaccurate) estimates that are provided are measured precisely. Second, achieving precision often becomes a matter of willingness to devote resources to an evaluation. In a sense, one can always "buy" more precision by simply increasing sample sizes, so there will always be a tradeoff between precision and the willingness to pay for the information that an evaluation is expected to yield. A third point concerns the relationship between precision and the choice of estimators. Because some of the estimators recorded in Table 1 will require that estimates be based only on sub-samples of the data20, this will involve an inescapable loss of precision. A more general related point is that for many of the estimators discussed in Table 1 it is not possible to make clear statements about precision. Indeed, in some cases, the uncertainty associated with an estimate cannot be calculated analytically and must be approximated using "bootstrap" simulation methods.
Given the complexities inherent in making statements about the precision of non-experimental estimates, it has become common to use illustrative precision estimates based on experimental designs in the planning of evaluations. Implicitly these illustrations assume the experimental calculations represent the best precision that might be expected in an evaluation and plans are based on this best-case scenario. Table 2 provides an example of this procedure. The calculations in the table show the size of impact that could be detected with 80 percent power assuming that a 95 percent one-tail test of significance is used in the evaluations. Standard deviations used in these calculations are drawn from the Nova Scotia feasibility study for the Medium Term Indicators (MTI) project. These may understate the actual precision that would be obtainable in the summative evaluations because they may have access to better control variables and statistical methodologies than were used in the Nova Scotia study. However, as we show in Section C, the precision estimates in the table are roughly consistent with what has been found in many actual experimental evaluations of active labour market programs.
Sample Sizes | Outcome (Assumed SD) | |||
Participant | Comparison | T4 Earnings (11000) | Employed (0.370) | EI weeks (9.00) |
2000 | 1000 | 1103 | 0.037 | 0.90 |
2000 | 2000 | 901 | 0.030 | 0.74 |
3000 | 1500 | 901 | 0.030 | 0.74 |
4000 | 2000 | 780 | 0.026 | 0.64 |
1000 | 1000 | 1274 | 0.043 | 1.04 |
1000 | 400 | 1685 | 0.057 | 1.38 |
500 | 400 | 1911 | 0.064 | 1.56 |
Assuming that these figures are reasonably reflective of what can be expected from the evaluations, several observations might be made:
Three major questions must be faced in designing the actual samples to be used in the summative evaluations: (1) How is the participant group to be selected? (2) How are the administrative and survey data to be used in combination to select the comparison group(s)? and (3) How should resources be allocated to the survey samples23?
The Panel did not believe that there were major conceptual issues involved in selection of a participant sample for the evaluations (assuming that administrative data were of sufficiently high quality). Participants would be selected from administrative data in a way that best represents program activity during some period. The Panel did make three minor recommendations about this selection process:
Selection of comparison groups is one of the most crucial elements of the evaluation design effort. In order to simplify the discussion of this issue we assume that all analysis will be conducted using regression adjusted mean estimates of the impact of Employment Benefits and Support Measures (EBSM) on participants. That is, we, for the moment, disregard some of the estimation issues raised in Section B in order to focus explicitly on comparison group selection issues. Examining whether our conclusions here would be changed if different estimation strategies (such as difference-in-difference or IV procedures) were used is a topic requiring further research, as we discuss in Section F.
Three data sources are potentially available for comparison group selection:
(1) (Employment Insurance (EI)-related administrative data; (2) Canada Customs and Revenue Agency (CCRA) administrative data on earnings and other tax-related information; and (3) survey data. It is important to understand the tradeoffs involved in using these various data sources and how those tradeoffs might influence other aspects of the analysis.
Survey Data: A third potential source of data for the development of a comparison sample is the follow-up survey that would be administered about two years after entry into the EBSM program. The advantage of this data source is that it provides the opportunity to collect consistent and very detailed data from both participants and potential comparison group members about pre-program labour force activities and other information related to possible entry into EBSM interventions. These data can be used in a variety of methodologies (including both characteristic and propensity score matching and a variety of IV procedures) that seek to control for participant/comparison differences.
The potential advantages of using the survey data to structure analytical methodologies in the evaluations should not obscure the shortcomings of these data, however. These include:
This discussion suggests two approaches to the comparison group specification problem:
The complexities and uncertainties involved in the design of the summative evaluations make it difficult to make definitive statements about desirable sample sizes. Still, the Panel believed that some conclusions about this issue could be drawn from other evaluations — especially those using random assignment. Because, in principle, randomly assigned samples pose no special analytical problems, they can be viewed as "base cases" against which non-experimental designs can be judged. Under ideal conditions where all the assumptions hold, a non-experimental design would be equivalent to a random assignment experiment once the appropriate analytical methodologies (such as pair-wise matching or IV techniques) have been applied. Hence, existing random assignment experiments provide an attractive model for the evaluations.
Table 3 records the sample sizes used for the analysis30 of a few of the leading random assignment evaluations31 in the United States:
Evaluation | Experimental Sample Size | Control Sample Size | Number of Treatments |
National JTPA | 13,000 | 7,000 | 3 |
Illinois UI Bonus | 4,186 | 3,963 | 1 |
NJ UI Bonus | 7,200 | 2,400 | 3 |
PA UI Bonus | 10,700 | 3,400 | 6 |
WA UI Bonus | 12,000 | 3,000 | 6 |
WA Job Search | 7,200 | 2,300 | 3 |
SC Claimant | 4,500 | 1,500 | 3 |
Supported Work | 3,200 | 3,400 | 1 |
S-D Income Maint. | 2,400 | 1,700 | 7 |
National H.I. | 2,600 | 1,900 | 3 |
Several patterns are apparent in this summary table:
Because many of these evaluations were seeking to measure outcomes quite similar to those expected to be measured in the EBSM evaluations, these sample sizes would appear to have some relevance to the EBSM case. Specifically, these experiences would seem to suggest effective comparison sample sizes of at least 2,000 individuals34. The case for such a relatively large comparison sample is buttressed by consideration of the nature of the treatments to be examined in the EBSM evaluation. Because the five major interventions offered under regional EBSM programs are quite different from each other, it will not be possible to obtain the efficiencies that arise from the tiered designs characteristic of the UI experiments35. Heterogeneity in the characteristics of participants in the individual EBSM interventions poses an added reason for a large comparison group. In evaluating individual interventions it is likely that only a small portion of the overall comparison group can be used for each intervention. Again, the case for a large comparison sample is a strong one.
Interest in evaluating the specific EB interventions raises additional complications for sample allocation. In most regions SD interventions are by far the most common. A simple random sample of interventions would therefore run the danger of providing very small sample sizes for the other interventions. Hence, it seems likely that most evaluations will stratify their participant samples by intervention and over-weight the smaller interventions36. If an evaluation adopts this structure, deciding on a comparison structure is a complex issue. Should the comparison sample also be stratified so that each sub-group mirrors individuals in the intervention strata? Or should the comparison sample seek to match the characteristics of a random sample of participants? The latter approach might be preferable if IV procedures were intended to play a major role in the analysis because such a broader comparison group might aid in predicting assignment to treatments. On the other hand, the use of intervention-specific comparison groups might be preferable for use with simpler estimation procedures. Evaluators should be expected to address such matters carefully in their designs.
Implementation of any sort of matching strategy in an evaluation that relies on surveys poses special problems. The general goal is to avoid "wasting" interviews of comparison group members that will never be used in the subsequent analysis. This usually requires that some significant portion of interviews of the participant sample be completed before the comparison interviews are fielded. In that way, the characteristics of the comparison sample can be adjusted based on the experiences encountered with the participant sample37. An even more complex surveying strategy might be necessary if only minimal administrative data are available for the sample selection process. In this case, it may be necessary to include some form of screening questions in the surveys of comparison cases so that surveys of non-comparable cases can be cut short before large amounts of time have been expended. This procedure has been suggested, for example, for dealing with the problem of finding comparison cases for participants in the reachback group. Because it is possible to build in biases through such survey-based sample selection procedures, they should be adopted only with extreme caution.
Four criteria guided the Panel's thoughts on outcome specification and measurement:
(1) It seems likely that for most evaluations some of the key measures38 will focus on
employment in the post-program39 period; (2) Given this focus, it seems clear that
sufficient employment information should be collected so that a variety of specific
measures can be calculated. This will aid in the tailoring of outcome measures to provide
an accurate reflection of specific program purposes40; (3) Data on a number of other key
socio-economic variables should be collected — primarily for use as control variables in
studying employment outcomes; and (4) Survey data collection techniques should strive
for direct comparability across different regional evaluations.
Seven specific actions that might serve to meet these criteria are:
Cost-benefit and cost-effectiveness analyses should be considered, but they are likely to play a secondary role in the evaluations. Estimates of impacts derived in the evaluations could play important roles in providing the bases for cost-benefit and cost-effectiveness analyses. Development of a relatively simple cost-effectiveness analysis would be straightforward assuming data on incremental intervention costs are available. The utility of such an analysis depends importantly on the ability to estimate impacts of specific interventions accurately — both from the perspective of quasi-experimental designs and the adequacy of sample sizes to provide sufficiently detailed estimates. Still, it may be possible to make some rough cross-interventions comparisons.
Conducting extensive cost-benefit analyses under the evaluations might present more significant difficulties, however, especially given the sizes of budgets anticipated. Some of the primary obstacles to conducting a comprehensive cost-benefit analysis include the possibility that many of the social benefits of the EBSM program may be difficult to measure, that estimating the long-run impacts of the programs may be difficult and that the overall size of the program may make displacement effects significant in some locations. Methodologies for addressing this latter issue in any simple way are especially problematic. For all of these reasons, the panel believed that the planned modest budgets of the various evaluations would not support the kind of research effort that would be required to mount a viable cost-benefit analysis. However, the panel strongly believes that some sort of cost-benefit analysis should be the ultimate goal of the evaluations because stakeholders will want to know whether the programs are "worth" what they cost. Hence, it believes that it may be prudent for HRDC to consider how a broader cost-benefit approach might be conducted by using the combined data42 from several of the evaluations taken together as part of a separate research effort.
The Labour Market Development Agreements (LMDA) evaluations will play an important role in the process leading to the development of medium term indicators for the Employment Benefits and Support Measures (EBSM) program (the "MTI project"). Because these indicators will be constructed mainly from administrative data, the evaluations offer the opportunity to appraise the potential shortcomings associated with indicators that are based on relatively limited data only. Such a comparison can help to clarify how impacts measured in the MTI project may fail to capture all of the effects (both positive and negative) of Employment Benefit (EB) interventions. The availability of a richer data set in the evaluations may also suggest simple ways in which planned medium term indicators might be improved (say by combining data from several administrative sources or by using innovative ways of aggregating over time). Alternatively, the MTI project will, by virtue of its much larger sample sizes and on-going operations, permit the study of effects on sub-groups and recent program innovations that cannot be addressed in the LMDA evaluations. MTI-generated impacts from such investigations will be more meaningful if the evaluations have established baseline validity measures that show how such outcomes relate to the more detailed impact estimates from the LMDA evaluations. For all of these reasons, coordination between the two projects is essential. In order to achieve that coordination the panel suggested the following general guidelines:
This paper presents a state of the art generic framework for the evaluations of the Employment Benefits and Support Measures (EBSMs). In applying the methodologies outlined in this paper, those implementing the evaluations should take into consideration conditions unique to their particular jurisdictions. In the process, any assumptions set against the framework as well as modifications and extensions should be defensible. They should be well documented and rationalized on the basis of the evidence, the justification for the application of the techniques, and the underlying knowledge and theory. The credibility of results can benefit from the simultaneous use of several techniques to obtain multiple lines of evidence. Moreover, to reflect some of the uncertainties inherent in quasi-experimental designs, the implementers should be encouraged to provide ranges of estimates, as opposed to single point estimates.
This paper has provided a very general treatment of some of the major issues that can be expected to arise as the designs of the EBSM evaluations proceed. The intent here was not to provide an explicit blueprint for each such evaluation, but rather to describe some of the universal, methodological questions that will have to be addressed. The Panel believes that the EBSM evaluations offer the possibility of providing cutting-edge evidence in evaluation research while, at the same time, providing substantial information that will be both valid and useful to policy-makers.
The panel identified a number of areas that require further attention. It therefore falls to the implementers to fill in the details.
Ways should be considered in which the results from the evaluations can be pooled in subsequent analyses. The Panel believes that the pooling of results from across the evaluations, although it may pose problems of statistical non-comparability, should be an essential part of the EBSM evaluation process and contractors should be encouraged to make their data available for such purposes. Such pooling would have a number of advantages including:
To achieve these important goals, it will be necessary to impose some degree of comparability on data collection and accessibility across the evaluations. Such standards must be built in from the start — it may prove impossible to pool the data across evaluations if each is permitted to have complete freedom in defining unique approaches to data issues.
Ashenfelter, Orley. "Estimating the Effects of Training Programs on Earnings" Review of Economics and Statistics January, 1978, pp. 47-57.
Ashenfelter, Orley. "Estimating the Effect of Training Programs on Earnings with Longitudinal Data" in F. Bloch (ed.) Evaluating Manpower Training Programs. Greenwich (CT), JAI Press, 1979, pp. 97-117.
Ashenfelter, Orley and Card, David "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs" Review of Economics and Statistics, June, 1985, pp. 648-660.
Dehejia, Rajeev and Wahba, Sadek. "Propensity Score Matching Methods for Non-experimental Causal Studies." National Bureau of Economic Research (Cambridge, MA) Working Paper NO. 6829, 1998.
Dehejia, Rajeev and Wahba, Sadek. "Causal Effects in Non-experimental Studies: Reevaluating the Evaluation of Training Programs." Journal of the American Statistical Association, December 1999, 94(448), pp. 1053-62.
Greenberg, David and Shroder, Mark. The Digest of Social Experiments, Second Edition. Washington, D.C. The Urban Institute Press, 1997.
Heckman, James. "Varieties of Selection Bias." American Economic Review Papers and Proceedings, May 1990, 80(2), pp. 313-18.
Heckman, James and Hotz, Joseph. "Choosing Among Alternative Non-experimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training." Journal of the American Statistical Association, December 1989, 84(408), pp. 862-74.
Heckman, James, Ichimura, Hidehiko; Smith, Jeffrey and Todd, Petra. "Characterizing Selection Bias Using Experimental Data." Econometrica, September 1998a, 66(5), pp. 1017-98.
Heckman James, Ichimura, Hidehiko and Todd, Petra. "Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme." Review of Economic Studies, October 1997, 64(4), pp. 605-54.
Heckman, James, Ichimura, Hidehiko and Todd, Petra. "Matching as an Econometric Evaluation Estimator." Review of Economic Studies, April 1998b, 65(2), pp. 261-94.
Heckman, James, Lalonde, Robert and Smith, Jeffrey. "The Economics and Econometrics of Active Labor Market Programs," in Orley Ashenfelter and David Card, eds., Handbook of labor economics, Vol. 3A. Amsterdam: North-Holland, 1999, pp. 1865-2097.
Heckman, James and Smith, Jeffrey. "The Pre-program Earnings Dip and the Determinants of Participation in a Social Program: Implications for Simple Program Evaluation Strategies." Economic Journal, July 1999, 109(457), pp. 313-48.
Ichimura, Hidehiko and Taber, Christopher. "Direct Estimation of Policy Impacts." National Bureau of Economic Research (Cambridge, MA) Technical Working Paper No. 254, 2000.
Imbens, Guido and Angrist, Joshua. "Identification and Estimation of Local Average Treatment Effects." Econometrica, March 1994, 62(2), pp. 467-75.
LaLonde, Robert. "Evaluating the Econometric Evaluations of Training Programs with Experimental Data." American Economic Review, September 1986, 76(4), pp. 604-20.
Newey, Whitney and Powell, James. "Instrumental Variables for Nonparametric Models." Unpublished manuscript, Princeton University, 1989
Nicholson, Walter. "Assessing the Feasibility of Measuring Medium Term Net Impacts of the EBSM Program in Nova Scotia" Working Paper prepared for HRDC, March, 2000.
Nicholson, Walter. "Design Issues in the Summative Evaluations: A Summary of Meetings..." HRDC, March, 2001.
Rosenbaum, Paul and Rubin, Donald. "The Central Role of the Propensity Score in Observational Studies for Causal Effects." Biometrika, April 1983, 70(1), pp. 41-55.
Smith, Jeffrey and Todd, Petra. "Reconciling Conflicting Evidence on Performance of Propensity Score Matching Methods." American Economic Review Papers and Proceedings, May 2001, 91(2), pp. 112-18.
This annex identifies a range of possible program effects for the evaluation. This "possible program effects approach" is consistent with the framework used by the Office of the Auditor General in the recent past to audit program effectiveness measurement by federal departments and agencies.
The intention for evaluation purposes is not to measure individual results versus some desired quantitative goal or detailed quantitative benchmark which is expected to be achieved. Rather, the approach is to examine and measure, as comprehensively as possible, what it is that the EBSM/LMDA-funded programs have been doing (their activities) and what the results have been.
The intention is to undertake the same basic evaluation approach across all of the regions and then develop an aggregate analysis at the national level with respect to overall EBSM/LMDA program results. Provision is also to be built-in for local flexibility in terms of subject matter areas covered by providing for any further, additional evaluation questions/indicators which individual regions may wish to include.
It is expected that, once the evaluation process proceeds, consultants selected will prepare methodology reports that will deal in more detail with the measurement process in the areas identified.
The possible effects listed below in the section entitled "Possible Program Effects "are based on issues that have considered important in many employment-related policies and studies implemented by various researchers and levels of government nationally and internationally. In summary, important sources for issues with respect to EBSM evaluations can be derived from:
As indicated in the body of this report, the methodology includes several measurement techniques that may be applied to a broad range of outcomes/indicators. Where appropriate, therefore, Human Resources Development Canada (HRDC) expects that outcomes will be measured as follows: post- program outcomes, differences between pre and post program experiences, and comparisons between participant and non-participant experiences.
POSSIBLE PROGRAM EFFECTS
PROGRAM OUTCOMES
INDIVIDUAL CLIENTS ASSISTED
KEY CLIENTELE CHARACTERISTICS (Focus on key types of problems being encountered by individual clients assisted under the LMDA arrangements).
LOCAL COMMUNITIES
GOVERNMENTS
FOLLOW-UP TO FORMATIVE STUDIES
Last Modified: 2002-04-19 | Important Notices |