HRDC - The Design of Summative Evaluations for the Employment Benefits and Support Measures (EBSM)

Directorates

· Director General's Office

· Data Development

· Strategic Evaluations

· Partnership and Continuous Evaluations

Publications

· Evaluation Reports

· Evaluation Briefs

· Labour Market Development Agreement Series

· Lessons Learned Series

· Evaluation Tool Kit Series

· Evaluation Forum Newsletter

· From Theory To Practice

C. Sample Design

Three major questions must be faced in designing the actual samples to be used in the summative evaluations: (1) How is the participant group to be selected? (2) How are the administrative and survey data to be used in combination to select the comparison group(s)? and (3) How should resources be allocated to the survey samples²³?

1. Participant Selection:

The Panel did not believe that there were major conceptual issues involved in selection of a participant sample for the evaluations (assuming that administrative data were of sufficiently high quality). Participants would be selected from administrative data in a way that best represents program activity during some period. The Panel did make three minor recommendations about this selection process:

Participants should be sampled over an entire year²⁴ so as to minimize potential seasonal influences on the results;
Participants should be stratified by Employment Benefit (EB) type and by location. This would ensure adequate representation of interventions with relatively small numbers of participants. It would also ensure representation of potential variations in program content across a region; and
The participant sample should be selected so that there would be a significant postprogram period before surveying would be undertaken. Practically, this means that at least one year should have elapsed between the end of the sampling period and the start of the survey period²⁵.

2. Comparison Group Selection

Selection of comparison groups is one of the most crucial elements of the evaluation design effort. In order to simplify the discussion of this issue we assume that all analysis will be conducted using regression adjusted mean estimates of the impact of Employment Benefits and Support Measures (EBSM) on participants. That is, we, for the moment, disregard some of the estimation issues raised in Section B in order to focus explicitly on comparison group selection issues. Examining whether our conclusions here would be changed if different estimation strategies (such as difference-in-difference or IV procedures) were used is a topic requiring further research, as we discuss in Section F.

Three data sources are potentially available for comparison group selection:
(1) (Employment Insurance (EI)-related administrative data; (2) Canada Customs and Revenue Agency (CCRA) administrative data on earnings and other tax-related information; and (3) survey data. It is important to understand the tradeoffs involved in using these various data sources and how those tradeoffs might influence other aspects of the analysis.

EI-related data: These data are the most readily available for comparison group selection. Comparison group members could be selected on the basis of EI history and/or on basis of data on past job separations (the Record of Employment (ROE) data). Such matching would probably do a relatively poor job of actually matching participants' employment histories. That would be especially true for so-called "reachback" clients — those who are not currently on an active EI claim. Although it would be feasible to draw a comparison sample of individuals filing claims in the past, it seems likely that such individuals would have more recent employment experiences than would clients in the reachback group. Hence, even if matching solely on the EI data were considered suitable for the active claimant group (in itself a doubtful proposition) it would be necessary to adopt additional procedures for reachback clients.
CCRA Earnings Data: Availability of CCRA earnings data plays a potentially crucial role in the design of the summative evaluations. It is well known²⁶ that earnings patterns in the immediate pre-program period are an important predictor of program participation itself (Ashenfelter 1978, Heckman and Smith 1999). More generally, it is believed that adequately controlling for earnings patterns is one promising route to addressing evaluation problems raised by unobservable variables (Ashenfelter 1979). This supposition is supported by some early estimates of the impact of EBSM interventions in Nova Scotia (Nicholson 2000) which illustrate how CCRA data can be used in screening a broadly-defined comparison group to look more like a group of program participants²⁷. Unfortunately, the extent to which these data will be available to EBSM evaluators is currently unknown. But, given the suggested sample selection and survey scheduling, it would have been feasible under previous data access standards to obtain an extensive pre-program profile for virtually all participants and potential comparison group members. Regardless of whether one opted for a general screening to produce a comparison group or used some form of pair-wise matching on an individual basis, it seems quite likely that a rather close matching on observed earnings could be obtained.
Survey Data: A third potential source of data for the development of a comparison sample is the follow-up survey that would be administered about two years after entry into the EBSM program. The advantage of this data source is that it provides the opportunity to collect consistent and very detailed data from both participants and potential comparison group members about pre-program labour force activities and other information related to possible entry into EBSM interventions. These data can be used in a variety of methodologies (including both characteristic and propensity score matching and a variety of IV procedures) that seek to control for participant/comparison differences.

The potential advantages of using the survey data to structure analytical methodologies in the evaluations should not obscure the shortcomings of these data, however. These include:

The survey data on pre-program activities will not be a true "baseline" survey. Rather, the data will be collected using questions that ask respondents to remember events several years in the past. Errors in recall on such surveys can be very large — and such errors will be directly incorporated into the methodologies that rely heavily on such retrospective survey data.
Using the survey data to define comparison groups will necessarily result in some reduction in sample sizes ultimately available for analysis — simply because some of the surveyed individuals may prove to be inappropriate as comparison group members. The extent of this reduction will depend importantly on how much matching can be done with the administrative data. In the absence of the CCRA data such reductions could be very large. This would imply that a large amount of the funds spent on the survey might ultimately prove to have been expended for no analytical purpose.
Finally, there is the possibility that reliance on the survey data to define comparison groups may compromise the primary outcome data to be collected by the survey. Most obviously this compromise would occur if the space needed in the survey to accommodate extensive pre-program information precluded the collection of more detailed post-program data. On a more subtle level, collecting both extensive pre- and post- program data in the same survey may encourage respondents to shade their responses in ways that impart unknown biases into the reported data. Although there is no strong empirical evidence on this possibility, survey staff will often check back-and-forth over a specific set of responses asking whether they are consistent. This may improve the quality of the data, but it may also impart biases to the reporting of retrospective data and therefore impact some estimates.

3. Suggested Approaches

This discussion suggests two approaches to the comparison group specification problem:

The Ideal Approach: It seems clear that, given the data potentially available to the evaluations, the ideal approach to comparison group selection would be to use both EI and CCRA data to devise a comparison sample that closely matched the participant sample along dimensions observable in these two data sources. Pair-wise matching might be feasible with these large administrative data sets, but such an approach to matching may not be the best design from an overall perspective depending on the precise analytical strategies to be followed. For example, a strategy that focused more on sample screening to achieve matching may perform better in applications where there is an intention to utilize IV estimation procedures because screening may retain more of the underlying variance in variables that determine participation. Pair-wise matching also poses some logistical problems in implementation at the survey stage. Surveys could be conducted with a random sample of pairs, but non-response by one or the other member of a pair would exacerbate considerably the problem of non-response bias. Of course, no procedure of matching based on observable variables can promise that unobservable differences will not ultimately yield inconsistent results. Still, an approach that makes some use of EI and CCRA data to select the comparison sample would seem to be the most promising strategy in many circumstances. Some research on how first stage matching can best be utilized in an evaluation that will ultimately be based on survey data would clearly be desirable, however.
The Alternative Approach: If CCRA data are not available for sample selection in an evaluation, it would be necessary to adopt a series of clearly second-best procedures. These would start with some degree of rough matching using available EI data. The data sources and variables to be used might vary across the evaluations depending on the nature of local labour markets and of the individuals participating in the EB interventions. Following this rough matching, all further comparison group procedures would be based on the survey data. This approach would have three major consequences for the overall design of the evaluations:
1. The survey would have to be longer so that adequate information of preprogram labour force activities could be gathered;
2. The comparison group would have to be enlarged relative to the "ideal" plan (see the next sub-section) to allow for the possibility of surveying individuals who ultimately prove to be non-comparable; and
3. The relative importance of matching methods would have to be reduced in the evaluations (if only because of the reduced sample sizes inherent in using a matching methodology based on the survey data) and the role for IV procedures²⁸ expanded. Because of this necessity of relying of IV procedures it may be necessary to devote additional resources to the development of "good" instruments either through additional data collection²⁹ or through some type of random assignment process.

4. Sample Sizes

The complexities and uncertainties involved in the design of the summative evaluations make it difficult to make definitive statements about desirable sample sizes. Still, the Panel believed that some conclusions about this issue could be drawn from other evaluations — especially those using random assignment. Because, in principle, randomly assigned samples pose no special analytical problems, they can be viewed as "base cases" against which non-experimental designs can be judged. Under ideal conditions where all the assumptions hold, a non-experimental design would be equivalent to a random assignment experiment once the appropriate analytical methodologies (such as pair-wise matching or IV techniques) have been applied. Hence, existing random assignment experiments provide an attractive model for the evaluations.

Table 3 records the sample sizes used for the analysis³⁰ of a few of the leading random assignment evaluations³¹ in the United States:

Table 3: Analysis Sample Sizes in a Selection of Random Assignment Experiments
Evaluation	Experimental Sample Size	Control Sample Size	Number of Treatments
National JTPA	13,000	7,000	3
Illinois UI Bonus	4,186	3,963	1
NJ UI Bonus	7,200	2,400	3
PA UI Bonus	10,700	3,400	6
WA UI Bonus	12,000	3,000	6
WA Job Search	7,200	2,300	3
SC Claimant	4,500	1,500	3
Supported Work	3,200	3,400	1
S-D Income Maint.	2,400	1,700	7
National H.I.	2,600	1,900	3

Several patterns are apparent in this summary table:

Sample sizes are all fairly large — control samples are at least 1,500 and more usually in the 2,000+ range;
Single treatment experiments tend to opt for equal allocations of experimental and control cases³²;
Evaluations with multiple treatments allocate relatively larger portions of their samples to treatment categories. Usually the control groups are larger than any single treatment cell, however; and
Although it is not apparent in the table, many of the evaluations utilized a "tiered" treatment design in which more complex treatments were created by adding components to simple treatments (this was the case for many of the UI-related evaluations, for example). In this case, the simple treatments can act as "controls" for the more complex ones by allowing measurement of the incremental effects of the added treatments³³. Hence, the effective number of "controls" may be understated in the table for these evaluations.

Because many of these evaluations were seeking to measure outcomes quite similar to those expected to be measured in the EBSM evaluations, these sample sizes would appear to have some relevance to the EBSM case. Specifically, these experiences would seem to suggest effective comparison sample sizes of at least 2,000 individuals³⁴. The case for such a relatively large comparison sample is buttressed by consideration of the nature of the treatments to be examined in the EBSM evaluation. Because the five major interventions offered under regional EBSM programs are quite different from each other, it will not be possible to obtain the efficiencies that arise from the tiered designs characteristic of the UI experiments³⁵. Heterogeneity in the characteristics of participants in the individual EBSM interventions poses an added reason for a large comparison group. In evaluating individual interventions it is likely that only a small portion of the overall comparison group can be used for each intervention. Again, the case for a large comparison sample is a strong one.

Interest in evaluating the specific EB interventions raises additional complications for sample allocation. In most regions SD interventions are by far the most common. A simple random sample of interventions would therefore run the danger of providing very small sample sizes for the other interventions. Hence, it seems likely that most evaluations will stratify their participant samples by intervention and over-weight the smaller interventions³⁶. If an evaluation adopts this structure, deciding on a comparison structure is a complex issue. Should the comparison sample also be stratified so that each sub-group mirrors individuals in the intervention strata? Or should the comparison sample seek to match the characteristics of a random sample of participants? The latter approach might be preferable if IV procedures were intended to play a major role in the analysis because such a broader comparison group might aid in predicting assignment to treatments. On the other hand, the use of intervention-specific comparison groups might be preferable for use with simpler estimation procedures. Evaluators should be expected to address such matters carefully in their designs.

5. Issues in Implementation

Implementation of any sort of matching strategy in an evaluation that relies on surveys poses special problems. The general goal is to avoid "wasting" interviews of comparison group members that will never be used in the subsequent analysis. This usually requires that some significant portion of interviews of the participant sample be completed before the comparison interviews are fielded. In that way, the characteristics of the comparison sample can be adjusted based on the experiences encountered with the participant sample³⁷. An even more complex surveying strategy might be necessary if only minimal administrative data are available for the sample selection process. In this case, it may be necessary to include some form of screening questions in the surveys of comparison cases so that surveys of non-comparable cases can be cut short before large amounts of time have been expended. This procedure has been suggested, for example, for dealing with the problem of finding comparison cases for participants in the reachback group. Because it is possible to build in biases through such survey-based sample selection procedures, they should be adopted only with extreme caution.

Footnotes

²³	Administrative data are treated here as being plentiful and costless to collect. Sample sizes in the administrative data collection phase of the evaluations are therefore treated as unlimited. For specific, infrequent interventions this may not be the case, however, so we do briefly discuss sample allocations among interventions.
²⁴	Use of a Fiscal Year would also facilitate comparisons to other administrative data — especially if start dates were used to define participation.
²⁵	If surveys were conducted over an entire year this would permit two years to have elapsed since the program start date. If surveys were bunched so as to create interviewing efficiencies, the Panel recommended a longer period between the end of the sample period and the start of interviewing (perhaps 18 months or more).
²⁶	At least this is a common finding in evaluations of active labour market programs in the United States. Heckman, Lalonde, and Smith (1999) suggest that this "Ashenfelter dip" may be a worldwide phenomenon. The extent of the phenomenon among EBSM participants is unknown, although some preliminary work on data from Nova Scotia (Nicholson, 2000) suggests that the dip occurs in that program too.
²⁷	Other variables available in CCRA tax data that may be used to match participant and comparison group members include, for example, total income, total family income, number of dependents, and so forth.
²⁸	Difference-in-difference methods might also be used more extensively though the use of such methods with data from a single survey opens the possibility of correlations in reporting errors over time biasing results.
²⁹	For example, collecting data on geographical distances to service providers may provide a useful instrument in some locations.
³⁰	These "final" sample sizes allow for survey and item nonresponse. Initial sample sizes would have to be increased to allow for such attritions.
³¹	For a summary of many of these evaluations together with an extensive set of references, see Greenberg and Shroder (1997).
³²	Such an allocation would minimize the variance of an estimated treatment effect for a given evaluation budget assuming that treatment and control cases are equally costly.
³³	In many of the evaluations, however, the less elaborate treatments often prove to be the most effective. That is the case in practically all of the UI-related experiments.
³⁴	This conclusion is roughly consistent with the illustrative power calculations presented in Section B (which are based on variances observed in the Nova Scotia data). It should again be noted that the sample sizes suggested here are effective sample sizes. In other words, these would be the number of cases available for analysis, net of out-of-scope cases, non-responses to surveys, and so forth. In addition, where income data are required, individual projects may be advised to implement the surveys in the field at approximately tax time so that respondents will have all their tax data readily available and fresh in their minds during the interview.
³⁵	A possible tiered design would be to adopt an EAS-only cell in some of the evaluations, however. Experiences from the UI experiments in the United States suggests that the EAS-only treatment might indeed have some detectable effects.
³⁶	More generally, the participant sample could be allocated to specific interventions on the basis of regional policy interest in each such intervention. This might dictate an allocation plan that represented a compromise between proportionate representation and stratification into equal size cells.
³⁷	The actual tailoring of such procedures to deal with survey nonresponse can be a very tricky issue, especially if participant and comparison groups have differential rates of nonresponse. This is another reason why issues related to nonresponse warrant a prominent place in evaluation designs.

Last Modified: 2002-04-19	Important Notices