![](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/spacer.gif) |
![](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/spacer.gif) |
C. Sample Design
Three major questions must be faced in designing the actual samples to be used in the
summative evaluations: (1) How is the participant group to be selected? (2) How are the
administrative and survey data to be used in combination to select the comparison
group(s)? and (3) How should resources be allocated to the survey samples23?
1. Participant Selection:
The Panel did not believe that there were major conceptual issues involved in selection of
a participant sample for the evaluations (assuming that administrative data were of
sufficiently high quality). Participants would be selected from administrative data in a way
that best represents program activity during some period. The Panel did make three minor
recommendations about this selection process:
- Participants should be sampled over an entire year24 so as to minimize potential
seasonal influences on the results;
- Participants should be stratified by Employment Benefit (EB) type and by location.
This would ensure adequate representation of interventions with relatively small
numbers of participants. It would also ensure representation of potential variations
in program content across a region; and
- The participant sample should be selected so that there would be a significant postprogram
period before surveying would be undertaken. Practically, this means that
at least one year should have elapsed between the end of the sampling period and
the start of the survey period25.
2. Comparison Group Selection
Selection of comparison groups is one of the most crucial elements of the evaluation
design effort. In order to simplify the discussion of this issue we assume that all analysis
will be conducted using regression adjusted mean estimates of the impact of Employment
Benefits and Support Measures (EBSM) on participants. That is, we, for the moment,
disregard some of the estimation issues raised in Section B in order to focus explicitly on
comparison group selection issues. Examining whether our conclusions here would be
changed if different estimation strategies (such as difference-in-difference or IV
procedures) were used is a topic requiring further research, as we discuss in Section F.
Three data sources are potentially available for comparison group selection: (1) (Employment Insurance (EI)-related administrative data; (2) Canada Customs and Revenue Agency (CCRA) administrative data on earnings and other tax-related information; and (3) survey data. It is important to understand the tradeoffs involved in using these various data sources and how those tradeoffs might influence other aspects of the analysis.
-
EI-related data: These data are the most readily available for comparison group
selection. Comparison group members could be selected on the basis of EI history
and/or on basis of data on past job separations (the Record of Employment (ROE)
data). Such matching would probably do a relatively poor job of actually matching
participants' employment histories. That would be especially true for so-called
"reachback" clients — those who are not currently on an active EI claim. Although
it would be feasible to draw a comparison sample of individuals filing claims in the
past, it seems likely that such individuals would have more recent employment
experiences than would clients in the reachback group. Hence, even if matching
solely on the EI data were considered suitable for the active claimant group (in itself
a doubtful proposition) it would be necessary to adopt additional procedures for
reachback clients.
-
CCRA Earnings Data: Availability of CCRA earnings data plays a potentially
crucial role in the design of the summative evaluations. It is well known26 that
earnings patterns in the immediate pre-program period are an important predictor of
program participation itself (Ashenfelter 1978, Heckman and Smith 1999). More
generally, it is believed that adequately controlling for earnings patterns is one
promising route to addressing evaluation problems raised by unobservable variables
(Ashenfelter 1979). This supposition is supported by some early estimates of the
impact of EBSM interventions in Nova Scotia (Nicholson 2000) which illustrate
how CCRA data can be used in screening a broadly-defined comparison group to
look more like a group of program participants27. Unfortunately, the extent to which
these data will be available to EBSM evaluators is currently unknown. But, given
the suggested sample selection and survey scheduling, it would have been feasible
under previous data access standards to obtain an extensive pre-program profile for
virtually all participants and potential comparison group members. Regardless of
whether one opted for a general screening to produce a comparison group or used
some form of pair-wise matching on an individual basis, it seems quite likely that a
rather close matching on observed earnings could be obtained.
-
Survey Data: A third potential source of data for the development of a comparison
sample is the follow-up survey that would be administered about two years after
entry into the EBSM program. The advantage of this data source is that it provides the opportunity to collect consistent and very detailed data from both participants
and potential comparison group members about pre-program labour force activities
and other information related to possible entry into EBSM interventions. These data
can be used in a variety of methodologies (including both characteristic and
propensity score matching and a variety of IV procedures) that seek to control for
participant/comparison differences.
The potential advantages of using the survey data to structure analytical
methodologies in the evaluations should not obscure the shortcomings of these data,
however. These include:
- The survey data on pre-program activities will not be a true "baseline" survey.
Rather, the data will be collected using questions that ask respondents to
remember events several years in the past. Errors in recall on such surveys can
be very large — and such errors will be directly incorporated into the
methodologies that rely heavily on such retrospective survey data.
- Using the survey data to define comparison groups will necessarily result in
some reduction in sample sizes ultimately available for analysis — simply
because some of the surveyed individuals may prove to be inappropriate as
comparison group members. The extent of this reduction will depend
importantly on how much matching can be done with the administrative data. In
the absence of the CCRA data such reductions could be very large. This would
imply that a large amount of the funds spent on the survey might ultimately prove
to have been expended for no analytical purpose.
- Finally, there is the possibility that reliance on the survey data to define
comparison groups may compromise the primary outcome data to be
collected by the survey. Most obviously this compromise would occur if the
space needed in the survey to accommodate extensive pre-program information
precluded the collection of more detailed post-program data. On a more subtle
level, collecting both extensive pre- and post- program data in the same survey
may encourage respondents to shade their responses in ways that impart
unknown biases into the reported data. Although there is no strong empirical
evidence on this possibility, survey staff will often check back-and-forth over a
specific set of responses asking whether they are consistent. This may improve
the quality of the data, but it may also impart biases to the reporting of
retrospective data and therefore impact some estimates.
3. Suggested Approaches
This discussion suggests two approaches to the comparison group specification problem:
-
The Ideal Approach: It seems clear that, given the data potentially available to the evaluations, the ideal approach to comparison group selection would be to use both EI and CCRA data to devise a comparison sample that closely matched the participant sample along dimensions observable in these two data sources. Pair-wise matching might be feasible with these large administrative data sets, but such an approach to matching may not be the best design from an overall perspective depending on the precise analytical strategies to be followed. For example, a strategy that focused more on sample screening to achieve matching may perform better in applications where there is an intention to utilize IV estimation procedures because screening may retain more of the underlying variance in variables that determine participation. Pair-wise matching also poses some logistical problems in implementation at the survey stage. Surveys could be conducted with a random sample of pairs, but non-response by one or the other member of a pair would exacerbate considerably the problem of non-response bias. Of course, no procedure of matching based on observable variables can promise that unobservable differences will not ultimately yield inconsistent results. Still, an approach that makes some use of EI and CCRA data to select the comparison sample would seem to be the most promising strategy in many circumstances. Some research on how first stage matching can best be utilized in an evaluation that will ultimately be based on survey data would clearly be desirable, however.
-
The Alternative Approach: If CCRA data are not available for sample selection in an evaluation, it would be necessary to adopt a series of clearly second-best procedures. These would start with some degree of rough matching using available EI data. The data sources and variables to be used might vary across the evaluations depending on the nature of local labour markets and of the individuals participating in the EB interventions. Following this rough matching, all further comparison group procedures would be based on the survey data. This approach would have three major consequences for the overall design of the evaluations:
- The survey would have to be longer so that adequate information of preprogram
labour force activities could be gathered;
- The comparison group would have to be enlarged relative to the "ideal"
plan (see the next sub-section) to allow for the possibility of surveying
individuals who ultimately prove to be non-comparable; and
- The relative importance of matching methods would have to be reduced in the evaluations (if only because of the reduced sample sizes inherent in using a matching methodology based on the survey data) and the role for IV procedures28 expanded. Because of this necessity of relying of IV procedures it may be necessary to devote additional resources to the development of "good" instruments either through additional data collection29 or through some type of random assignment process.
4. Sample Sizes
The complexities and uncertainties involved in the design of the summative evaluations make it difficult to make definitive statements about desirable sample sizes. Still, the Panel believed that some conclusions about this issue could be drawn from other evaluations — especially those using random assignment. Because, in principle, randomly assigned samples pose no special analytical problems, they can be viewed as "base cases" against which non-experimental designs can be judged. Under ideal conditions where all the assumptions hold, a non-experimental design would be equivalent to a random assignment experiment once the appropriate analytical methodologies (such as pair-wise matching or IV techniques) have been applied. Hence, existing random assignment experiments provide an attractive model for the evaluations.
Table 3 records the sample sizes used for the analysis30 of a few of the leading random assignment evaluations31 in the United States:
Table 3: Analysis Sample Sizes in a Selection of Random Assignment
Experiments
Evaluation
|
Experimental Sample Size
|
Control Sample Size
|
Number of Treatments
|
National JTPA |
13,000 |
7,000 |
3 |
Illinois UI Bonus |
4,186 |
3,963 |
1 |
NJ UI Bonus |
7,200 |
2,400 |
3 |
PA UI Bonus |
10,700 |
3,400 |
6 |
WA UI Bonus |
12,000 |
3,000 |
6 |
WA Job Search |
7,200 |
2,300 |
3 |
SC Claimant |
4,500 |
1,500 |
3 |
Supported Work |
3,200 |
3,400 |
1 |
S-D Income Maint. |
2,400 |
1,700 |
7 |
National H.I. |
2,600 |
1,900 |
3 |
Several patterns are apparent in this summary table:
- Sample sizes are all fairly large — control samples are at least 1,500 and more usually
in the 2,000+ range;
- Single treatment experiments tend to opt for equal allocations of experimental and
control cases32;
- Evaluations with multiple treatments allocate relatively larger portions of their samples
to treatment categories. Usually the control groups are larger than any single treatment
cell, however; and
- Although it is not apparent in the table, many of the evaluations utilized a "tiered"
treatment design in which more complex treatments were created by adding
components to simple treatments (this was the case for many of the UI-related
evaluations, for example). In this case, the simple treatments can act as "controls" for
the more complex ones by allowing measurement of the incremental effects of the
added treatments33. Hence, the effective number of "controls" may be understated in the
table for these evaluations.
Because many of these evaluations were seeking to measure outcomes quite similar to those expected to be measured in the EBSM evaluations, these sample sizes would appear to have some relevance to the EBSM case. Specifically, these experiences would seem to suggest effective comparison sample sizes of at least 2,000 individuals34. The case for such a relatively large comparison sample is buttressed by consideration of the nature of the treatments to be examined in the EBSM evaluation. Because the five major interventions offered under regional EBSM programs are quite different from each other, it will not be possible to obtain the efficiencies that arise from the tiered designs characteristic of the UI experiments35. Heterogeneity in the characteristics of participants in the individual EBSM interventions poses an added reason for a large comparison group. In evaluating individual interventions it is likely that only a small portion of the overall comparison group can be used for each intervention. Again, the case for a large comparison sample is a strong one.
Interest in evaluating the specific EB interventions raises additional complications for sample allocation. In most regions SD interventions are by far the most common. A simple random sample of interventions would therefore run the danger of providing very small sample sizes for the other interventions. Hence, it seems likely that most evaluations will stratify their participant samples by intervention and over-weight the smaller interventions36. If an evaluation adopts this structure, deciding on a comparison structure is a complex issue. Should the comparison sample also be stratified so that each sub-group mirrors individuals in the intervention strata? Or should the comparison sample seek to match the characteristics of a random sample of participants? The latter approach might be preferable if IV procedures were intended to play a major role in the analysis because such a broader comparison group might aid in predicting assignment to treatments. On the other hand, the use of intervention-specific comparison groups might be preferable for use with simpler estimation procedures. Evaluators should be expected to address such matters carefully in their designs.
5. Issues in Implementation
Implementation of any sort of matching strategy in an evaluation that relies on surveys poses special problems. The general goal is to avoid "wasting" interviews of comparison group members that will never be used in the subsequent analysis. This usually requires that some significant portion of interviews of the participant sample be completed before the comparison interviews are fielded. In that way, the characteristics of the comparison sample can be adjusted based on the experiences encountered with the participant sample37. An even more complex surveying strategy might be necessary if only minimal administrative data are available for the sample selection process. In this case, it may be necessary to include some form of screening questions in the surveys of comparison cases so that surveys of non-comparable cases can be cut short before large amounts of time have been expended. This procedure has been suggested, for example, for dealing with
the problem of finding comparison cases for participants in the reachback group. Because it is possible to build in biases through such survey-based sample selection procedures, they should be adopted only with extreme
caution.
Footnotes
23 |
Administrative data are treated here as being plentiful and costless to collect. Sample sizes in the administrative data collection phase of the evaluations are therefore treated as unlimited. For specific, infrequent interventions this may not be the case, however, so we do briefly discuss sample allocations among interventions.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
24 |
Use of a Fiscal Year would also facilitate comparisons to other administrative data — especially if start dates were used to define participation.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
25 |
If surveys were conducted over an entire year this would permit two years to have elapsed since the program start date. If surveys were bunched so as to create interviewing efficiencies, the Panel recommended a longer period between the end of the sample period and the start of interviewing (perhaps 18 months or more).
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
26 |
At least this is a common finding in evaluations of active labour market programs in the United States. Heckman, Lalonde, and Smith (1999) suggest that this "Ashenfelter dip" may be a worldwide phenomenon. The extent of the phenomenon among EBSM participants is unknown, although some preliminary work on data from Nova Scotia (Nicholson, 2000) suggests that the dip occurs in that program too.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
27 |
Other variables available in CCRA tax data that may be used to match participant and comparison group members include, for example, total income, total family income, number of dependents, and so forth.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
28 |
Difference-in-difference methods might also be used more extensively though the use of such methods with data from a single survey opens the possibility of correlations in reporting errors over time biasing results.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
29 |
For example, collecting data on geographical distances to service providers may provide a useful instrument in some locations.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
30 |
These "final" sample sizes allow for survey and item nonresponse. Initial sample sizes would have to be
increased to allow for such attritions.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
31 |
For a summary of many of these evaluations together with an extensive set of references, see Greenberg and
Shroder (1997).
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
32 |
Such an allocation would minimize the variance of an estimated treatment effect for a given evaluation budget
assuming that treatment and control cases are equally costly.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
33 |
In many of the evaluations, however, the less elaborate treatments often prove to be the most effective. That is the case in practically all of the UI-related experiments.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
34 |
This conclusion is roughly consistent with the illustrative power calculations presented in Section B (which are based on variances observed in the Nova Scotia data). It should again be noted that the sample sizes suggested here are effective sample sizes. In other words, these would be the number of cases available for analysis, net of out-of-scope cases, non-responses to surveys, and so forth. In addition, where income data are required, individual projects may be advised to implement the surveys in the field at approximately tax time so that respondents will have all their tax data readily available and fresh in their minds during the interview.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
35 |
A possible tiered design would be to adopt an EAS-only cell in some of the evaluations, however. Experiences from the UI experiments in the United States suggests that the EAS-only treatment might indeed have some detectable effects.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
36 |
More generally, the participant sample could be allocated to specific interventions on the basis of regional policy interest in each such intervention. This might dictate an allocation plan that represented a compromise between proportionate representation and stratification into equal size cells.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
37 |
The actual tailoring of such procedures to deal with survey nonresponse can be a very tricky issue, especially if participant and comparison groups have differential rates of nonresponse. This is another reason why issues related to nonresponse warrant a prominent place in evaluation designs.
![[To Top]](/web/20060126172119im_/http://www11.hrdc-drhc.gc.ca/edd-img/top.gif) |
|