One of the main criticisms of experimental evaluations — where research units, such as people, schools, classrooms, and neighborhoods are randomly assigned to a program or to a control group — is that they are conducted under such special circumstances that their results may not apply to other people, places, or times. In evaluation parlance, experiments are criticized for having limited “external validity.”
This “problem” with external validity has two main features:
- Because experiments have a hard time recruiting sites to participate in these types of studies, the sites that do participate differ from those that do not; and
- Individuals that participate in experimental evaluations do not look like the broader population(s) of interest. This is generally a consequence of having selected non-representative sites.
These issues can be addressed through either design or analysis: Researchers can design and implement experiments to ensure that they take place in representative settings or researchers can adjust an experiment’s results so that they reflect a broader population of interest.
Improving External Validity through Design
Experiments have been successfully conducted in nationally representative sets of sites. Perhaps the two most notable of these are the National Job Corps Study and the Head Start Impact Study. Of course, a random selection of sites is the most straightforward way to ensure representativeness and the generalizability of an evaluation’s results. But there are other ways to ensure that the sites selected to be part of an experimental evaluation represent the sampling frame of interest.
Recent research provides more flexible alternatives to taking a random sample of sites that allow for administrative realities while ensuring the generalizability of study findings. Beth Tipton’s proposed approach involves grouping like sites, ranking them, and deliberately choosing them to mirror the population of interest. Under this approach, sites that decline to participate in the study can be replaced with the next site in the ranking, preserving the representativeness and making the approach practical and desirable for evaluation.
Improving External Validity through Analysis
Established recommendations exist for adjusting the results from a non-representative experimental evaluations so that they can be generalized to a population of interest. Generally, these involve:
- Analyzing the traits of an experimental sample and their association with the sample’s impacts;
- Comparing features of the sample to those in the broader population differ and how they differ; and
- Adjusting the study’s impacts to reduce differences between the sample and population. This has been addressed by Bell, Olsen, Orr, Stuart and colleagues, and Tipton.
What’s Next for the External Validity of Experiments?
I am excited about the next wave of experimental evaluations. Previous blog posts have discussed other ways in which experiments are increasingly providing better information about program effectiveness to practitioners and policymakers.
I am confident that the improvements to experimental designs in practice will help future evaluations provide more generalizable results. And, when implementing these design advancements is not possible, then analytic advancements can still help researchers, program administrators and policymakers extend the results from experiments to other people, places, and times.
Read more about these evaluation issues:
- New Directions for Evaluation, issue 152, Chapter 3 (Bell and Stuart, 2016) and Chapter 4 (Olsen and Orr, 2016)
- Evaluation Review, volume 41, issue 1, Part 1 of 2 of the Special Issue on External Validity and Policy (Part 2 of 2 is forthcoming.)