Common Problems with Formal Evaluations: Selection Bias and Publication Bias

A note on this page's publication date

The content on this page has not been recently updated. This content is likely to be no longer fully accurate, both with respect to the research it presents and with respect to what it implies about our views and positions.

Published: 2010

This page discusses the nature and extent of two common problems we see with formal evaluations: selection bias and publication bias.

Selection bias arises when participants in a program are systematically different from non-participants (even before they enter the program). Many evaluations compare program participants to non-participants in order to infer the effect of the program; selection bias can affect the legitimacy of these evaluations, and in particular, we believe that its presence is likely to skew evaluations of non-profits in the positive direction. More
Publication bias refers to the tendency of researchers to slant their choice of presentation and publication in a positive direction. More

We believe that these problems tend to skew evaluations of non-profits in the positive direction and that these problems are partially mitigated by the use of the randomized controlled trial (RCT) methodology (and can be mitigated by other techniques as well).

Below, we discuss both selection bias and publication bias: what they are, what sort of skew they are likely to bring about, why randomized controlled trials may suffer less from these issues, and what evidence is available on the extent and nature of these issues.

We then note two highly touted and studied interventions - microlending and Head Start - for which initial studies (highly prone to selection bias and publication bias) gave a much more positive picture than later results from randomized controlled trials. (More) We see these cases as suggestive evidence for the view that lower-quality studies tend to give an exaggerated case for optimism about effectiveness.

A note on this page's publication date
Selection bias
Publication bias
Suggestive evidence on the combined effects of selection bias and publication bias: the cases of microlending and Head Start
Bottom line
Sources

Selection bias

What is selection bias?

Studies of social programs commonly compare people who participated in the program to people who did not, with the implication being that any differences are caused by the program. (Some studies report only on improvements or good performance among participants, but even in these cases there is often an implicit comparison to non-participants - for example, an implicit presumption that non-participants would not have shown improvement on the reported-on measures.)

However, program participants are different from non-participants, by the very fact of their participation. An optional after-school tutoring program may disproportionately attract students/families who place a high priority on education (so its participants will have better reading scores, graduation rates, etc. than non-participants even if the program itself has no effect); a microlending program may disproportionately attract people who have higher incomes to begin with; etc.

What sort of skew is selection bias likely to cause?

Selection bias may skew a study in a positive or negative direction. Say that an after-school program targets struggling schools; in this case, comparing its participants to "average" students all across the city may be overly unfavorable to the tutoring program (since its students likely do not score as well as students in better-off schools), but comparing its participants to other students at the same schools may be overly favorable (since, as discussed above, its participants may tend to place higher priority on education).

One of the reasons we are concerned about selection bias is because it gives the researchers substantial room for judgment calls in their choice of comparison group. When it comes to studies on non-profits' impacts, we believe that researchers generally prefer to present the programs in a positive light, and thus tend to choose comparisons that favor the programs (more on this below under "Publication bias"). Thus, we feel that selection bias is generally likely to skew apparent results in favor of non-profits' programs.

Selection bias in low- vs. high-quality studies

Certain study designs are much less vulnerable to selection bias than others. A randomized evaluation,1 also known as a randomized controlled trial, generally avoids the problem of selection bias by using random assignment to assign some people and not others to a program; then people who were "lotteried in" (randomly assigned) to the program are tracked and compared to people who were "lotteried out." Intuitively speaking, this methodology seems to significantly reduce the risks that there will be any systematic differences between program participants and non-participants, other than whether they participated in the program.

There are other ways of addressing the problem of selection bias, fully or partially. Speaking broadly, we feel that randomization is the single most reliable indicator that a study's findings can be interpreted without fear of selection bias, and most of the studies we refer to as "high-quality" involve randomization. However, there are studies we consider "high-quality" that do not involve randomization, such as the impact evaluation of VillageReach's pilot project.

Example of selection bias

Peikes, Moreno, and Orzol (2008) evaluated the impact of the US State Partnership Initiative employment promotion program, using two methods: (a) a randomized controlled trial, with very low vulnerability to selection bias (see discussion above regarding randomization); (b) propensity-score matching, a relatively popular method for attempting to simulate a comparison between program participants and identical non-participants without the benefit of randomization (using available observable characteristics of participants and non-participants).2 Despite "seemingly ideal circumstances" for method (b),3 the two methods produced meaningfully different results: in two of the three locations, method (b) implied large, positive, statistically significant impacts of the program on earnings, while method (a) implied negative, non-statistically significant impacts of the program on earnings.4 The authors concluded:5

Despite these seemingly ideal conditions, and the passing of tests that, according to the literature, indicate PSM [propensity-score matching] had worked, PSM produced impact estimates that differed considerably from the gold standard experimental estimates in terms of statistical significance, magnitude, and most important, sign. Specifically, the PSM approach would have led policymakers to conclude incorrectly that the interventions increased earnings, when they actually decreased or had no effects on earnings. Based on this experience, our goal is to caution practitioners that PSM can generate incorrect estimates, even under seemingly ideal circumstances.

In this case, the attempt to compare program participants to similar non-participants using observable characteristics (which is what method (b) relied on) implied that the participants earned much more than non-participants; however, comparing lotteried-in to lotteried-out people showed no such thing. This implies that there were unobservable ways in which participants differed from non-participant, ways that were significant enough to create the illusion of a strong program effect.

Studies on selection bias

We conducted a search for literature reviews of studies directly comparing the results of randomized and non-randomized estimates of social programs' effects. The most complete and recent literature reviews are summarized here.

Our overall take on these studies is that they (a) focus on the best-designed non-randomized studies; (b) show mixed results, and give substantial reason for concern that non-randomized studies' results can diverge significantly from randomized studies' results.

They do not show that selection bias systematically skews results in one direction or another; they do show that the presence of selection bias introduces a substantial source of skew. We believe that in the case of programs run by non-profits this skew is likely to be positive more often than negative.

Review 1: Glazerman, Levy, and Myers (2003). This review examines twelve studies "in the context of welfare, job training, and employment services programs."6 Each of the studies estimates a program’s impact by using a randomized controlled trial, and separately estimates the impact by using one or more nonrandomized methods.7 Each of the programs aimed to raise earnings.8

“Four studies concluded that NX [nonrandomized] methods performed well, four found evidence that some NX methods performed well while others did not, and four found that NX methods did not perform well or that there was insufficient evidence
that they did perform well.”9
Aggregate analysis of the studies implied that the average effect found by nonrandomized studies differed by over $1,000 compared to the effect found by randomized studies. This was "about 10 percent of annual earnings for a typical population of disadvantaged workers."10
In the concluding section, the authors pose the question, "Can NX [nonrandomized] methods approximate the results from a well-designed and well-executed experiment?" Their answer is: "Occasionally, but many NX [nonrandomized] estimators produced results dramatically different from the experimental benchmark."11

Review 2: Bloom, Michalopoulos, and Hill (2005) reviews the question of randomized vs. nonrandomized evaluation in a variety of sectors.

Employment/earnings-related: several of the studies examined overlap with the studies discussed above. The exceptions to this overlap are:
- Friedlander and Robins (1995), which uses data from a series of large-scale studies of welfare-to-work programs in four states12 and concludes that "estimates of program effects from cross-state comparisons can be quite far from the true effects, even when samples are drawn (as ours were) with the same sample intake procedures and from target populations defined with the same objective characteristics."13
- An original comparison in Bloom, Michalopoulos, and Hill (2005) using a "six-state, seven-site evaluation that investigated different program approaches to moving welfare recipients to work."14 The authors conclude that a nonrandomized evaluation using several comparison groups might be able to match a randomized study's impact estimate and precision, but that "with respect to what methods could replace random assignment, we conclude that there are probably none that work well enough in a single replication, because the magnitude of the mismatch bias for any given nonexperimental evaluation can be large."15
Education. The review discusses two within-study comparisons of randomized and nonrandomized estimates of the impacts of a school programs, one aiming to prevent dropout and the other reducing class size.16 Both studies conclude that nonrandomized methods have a high risk of leading to misleading conclusions about impacts; the second study specifically states that "[in] 35 to 45 percent of the 11 cases … [nonrandomized methods] would have led to the 'wrong decision,' i.e., a decision about whether to invest which was different from the decision based on the experimental [randomized] estimates."17
Other. The review also discusses a meta-analysis (Lipsey and Wilson 1993) of other evaluations of psychological, education, and behavior programs. This study states that "In some treatment areas ... nonrandom designs (relative to random) tend to strongly underestimate effects, and in others, they tend to strongly overestimate effects. The distribution of differences on methodological quality ratings shows a similar pattern."18

Review 3: Cook, Shadish, and Wong (2008) analyzes twelve within-study comparisons of randomized and nonrandomized methods.19 These twelve comparisons are from ten publications, and span a variety of social programs, mainly from the US. The publications are not included in Glazerman, Levy, and Myers (2003), and only one is included in Bloom, Michalopoulos, and Hill (2005).

Cook, Shadish, and Wong (2008) finds that in two of the comparisons, the nonrandomized method sometimes achieves the same result as the randomized method and sometimes does not; in two other comparisons the nonrandomized methods fail to achieve the same result; and in the other eight comparisons the nonrandomized methods replicate the randomized methods reasonably.20 The review concludes that "the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature."21 However, the review explicitly concentrates on the "best possible [nonrandomized] design and analysis practice,"22 and also states:

This review showed that use of “off-the-shelf” (mostly demographic) covariates consistently failed to reproduce the results of experiments. Yet such variables are often all that is available from survey data. They are also, alas, all that some analysts think they need. The failure of “off-the-shelf” covariates indicts much current causal practice in the social sciences, where researchers finish up doing propensity score and OLS analyses of what are poor quasi-experimental designs and impoverished sets of covariates. Such designs and analyses do not undermine OLS or propensity scores per se—only poor practice in using these tools. Nonetheless, we are skeptical that statistical adjustments can play a useful role when population differences are large and quasi-experimental designs are weak … This means we are skeptical about much current practice in sociology, economics, and political science.23

Publication bias

What is publication bias?

"Publication bias" is a broad term for factors that systematically bias final, published results in the direction that the researchers and publishers (consciously or unconsciously) wish them to point.

Interpreting and presenting data usually involves a substantial degree of judgment on the part of the researcher; consciously or unconsciously, a researcher may present data in the most favorable light for his/her point of view. In addition, studies whose final conclusions aren't what the researcher (or the study funder) hoped for may be less likely to be made public.

What sort of skew is publication bias likely to cause?

As discussed below, the existing literature on publication bias often concludes that studies are skewed toward showing (a) more "surprising" findings; and (b) more "positive" findings (indicating that medical treatments, social policies, etc. "work").

We have not identified any studies specifically on publication bias in evaluations of non-profit programs, but we would guess that these studies would be skewed to the optimistic side, simply because the non-profits cooperating in the studies and the funders paying for them have incentives to portray their work in a positive light, and we know of no study funders or implementers with incentives to skew results in the pessimistic direction.

Publication bias in low- vs. high-quality studies

We are less concerned about publication bias in studies that have the following qualities, in descending order of importance:

Registration. ClinicalTrials.gov is an example of a registry where researchers post the design, methodology, and hypothesis for each study before data is actually collected. In our view, this makes researchers accountable to public scrutiny if results are later buried or interpreted in a skewed way. More on this idea
Randomized design. Above, we discuss the design of a randomized controlled trial (RCT), a study in which a lottery determines who is and isn't enrolled in a program. We agree with Esther Duflo's argument that a study with this sort of design is less susceptible to publication bias:24

Publication bias is likely to a particular problem with retrospective studies. Ex post the researchers or evaluators define their own comparison group, and thus may be able to pick a variety of plausible comparison groups; in particular, researchers obtaining negative results with retrospective techniques are likely to try different approaches, or not to publish … In contrast, randomized evaluations commit in advance to a particular comparison group: once the work is done to conduct a prospective randomized evaluation the results are usually documented and published even if the results suggest quite modest effects or even no effects at all.

High expense. In our view, a study that is very expensive to carry out is likely to be published regardless of what it shows and how favorable its findings are to the researchers' hopes. The presentation of the data may still be skewed, but the threat that the study is "buried" seems smaller.

We have not seen systematic investigations of the hypotheses laid out above.

Studies on publication bias

We have not yet conducted a systematic review of literature on publication bias, but we have come across several studies on the subject.

Medicine. Hopewell et al. (2009) reviewed five studies examining patterns in which clinical trials did and didn't have their results published in medical literature:

These studies showed that trials with positive findings … or those findings perceived to be important or striking, or those indicating a positive direction of treatment effect), had nearly four times the odds of being published compared to findings that were not statistically significant … or perceived as unimportant, or showing a negative or null direction of treatment effect.

Ioannidis (2005a and 2005b) explored the magnitude of the problem and concluded that from both a theoretical and empirical perspective, there is reason to be skeptical of much (even most) of the conclusions published in medical literature.25 These studies also provide some loose arguments that studies with less flexibility, particularly randomized controlled trials, are likely to be less susceptible to these issues.26

Economics. De Long and Lang (1992) gives some evidence for a broad form of publication bias in the field of economics. It examines published papers that fail to reject their central "null hypothesis" (the "null hypothesis" generally referring to a "general or default position, such as that there is no relationship between two measured phenomena"27 ) and finds that an aggregate analysis of these papers' results suggests that the individual results are erroneous - i.e., most or all of the central "null hypotheses" that the papers fail to reject are in fact false. It concludes that the best explanation for this phenomenon involves publication bias: papers rejecting their central "null hypothesis" are not published without prejudice, but rather published largely (or only) when their rejection is "exciting."28

Publication bias in more narrow topics.

Donohue and Wolfers (2006) presents evidence that papers on the deterrent effect of the death penalty seem skewed toward publishing positive results (i.e., results that imply a real deterrent effect).29
The Campbell Collaboration frequently tests for publication bias in studies on specific interventions such as volunteer tutoring programs30 and programs seeking to improve parental involvement in children's academics.31 In both of these cases, no evidence for publication bias was found. We intend to investigate the Campbell Library more thoroughly in the future for better context on the risks of publication bias.

Suggestive evidence on the combined effects of selection bias and publication bias: the cases of microlending and Head Start

If publication bias is a real and significant problem, this could be expected to imply that studies of social programs will tend to exaggerate the programs' impact - especially studies that are prone to selection bias and otherwise leave significant room for judgment calls on the part of the researchers. This idea is similar to Rossi (1987)'s "Stainless Steel Law of Evaluation," which is the proposition that:32

The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.

This law means that the more technically rigorous the net impact assessment, the more likely are its results to be zero—or no effect. Specifically, this law implies that estimating net impacts through randomized controlled experiments, the avowedly best approach to estimating net impacts, is more likely to show zero effects than other less rigorous approaches.

We have encountered no formal empirical study of this "law."33 We believe it to be valid based partly on our own experience of reviewing the highest-quality academic literature we can find compared with our experience of reviewing evaluations submitted by non-profits; we intend to document this comparison more systematically in the future.

Here, we discuss two cases that we believe provide suggestive evidence for the above proposition: microlending and Head Start. In both of these cases, we are able to compare a systematic overview of relatively low-quality studies (i.e., highly prone to selection bias, and with substantial room for judgment in their construction) to later evidence from randomized controlled trials. In both of these cases, the earlier, lower-quality research presents a much more optimistic picture than the randomized controlled trials.

Microlending, the practice of making small loans to low-income people (generally in the developing world), was the subject of many impact studies prior to 2005. These studies were collected and discussed in a 2005 literature review.34 This review concluded that the evidence for microfinance's impact was strong, and implied that randomized controlled trials could be expected to demonstrate impact as well.35

However, to date the results from the two randomized controlled trials on microlending have been far less encouraging:

Banerjee et al. (2009) conducted a randomized controlled trial of a microlending program in India and concluded that "15 to 18 months after the program, there was no effect of access to microcredit on average monthly expenditure per capita, but durable expenditure did increase … We find no impact on measures of health, education, or women's decision-making."36
A more recent study in rural Morocco found similar results, seeing different effects on different borrowers but no aggregate effect on measures of well-being.37

Head Start. A 2001 review examined studies on Head Start, a federal early childhood care program in the U.S., and found overwhelmingly positive, long-term effects on measures including achievement test scores and grade and school completion, while acknowledging the lack of a truly high-quality randomized study.38 In 2010, the first results from a very large, high-quality study became available and were far less encouraging.39

Bottom line

Speaking intuitively, we feel that the combination of selection bias and publication bias will cause most studies of non-profits' programs to exaggerate the case for optimism. We focus on studies that we think are less prone to these two biases, and believe that the randomized controlled trial (RCT) design is one of (though not the only) ways of mitigating these issues. We believe that higher-quality studies are likely to give a less positive picture of non-profit effectiveness than lower-quality studies

Sources

Agodini, Roberto and Mark Dynarski. 2004. Are experiments the only option? A look at dropout prevention programs. The Review of Economics and Statistics 86(1): 180-194.
Banerjee, Abhijit, et al. 2009. The miracle of microfinance? Evidence from a randomized evaluation (PDF).
Bloom, Michalopoulos, and Hill. 2005. Using experiments to access nonexperimental comparison-group methods for measuring program effects. In Learning More from Social Experiments, ed. Howard S. Bloom, 173-236. New York: Russell Sage Foundation.
ClinicalTrials.gov. Homepage. http://www.clinicaltrials.gov/ (accessed November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUT3tx5V.
Cook, Thomas D., William R. Shadish, and Vivian C. Wong. 2008. Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons (PDF). Journal of Policy Analysis and Management 27(4): 724-750.
Currie, Janet. 2001. Early childhood education programs (PDF). Journal of Economic Perspectives 15(2): 213-238.
De Long, J. Bradford and Kevin Lang. 1989. Are all economic hypotheses false? (PDF). Journal of Political Economy 100(6): 1257-72.
Donohue, John J. and Justin Wolfers. 2006. Uses and abuses of empirical evidence in the death penalty debate. Standford Law Review 58: 791-846. Abstract available at http://www.nber.org/papers/w11982 (accessed November 29, 2010). Archived by WebCite® at http://www.webcitation.org/5ubSlMHqE.
Duflo, Esther and Michael Kremer. 2003. Use of randomization in the evaluation of development effectiveness (PDF). In Conference on Evaluation and Development Effectiveness, Washington DC, 2003. Washington DC: World Bank Operations Evaluation Department.
Friedlander, Daniel and Philip K. Robins. 1995. Evaluating program evaluations: New evidence on commonly used nonexperimental methods. American Economic Review 85(4): 923-937.
GiveWell Blog. High-quality study of Head Start early childhood care program.
Glazerman, Steven, Dan M. Levy, and David Myers. 2003. Nonexperimental versus experimental estimates of earnings impacts. Annals of the American Academy of Political and Social Science 589(1): 63-93. Abstract available at http://ann.sagepub.com/content/589/1/63.short (accessed November 29, 2010). Archived by WebCite® at http://www.webcitation.org/5ubSEV2OZ.
Goldberg, Nathanael. 2005. Measuring the impact of microfinance: Taking stock of what we know (PDF). Washington DC: Grameen Foundation USA.
Hopewell, S., et al. 2009. Publication bias in clinical trials due to statistical significance or direction of trial results. Cochrane Database of Systematic Reviews 2009, Issue 1. Summary available at http://www2.cochrane.org/reviews/en/mr000006.html (accessed on November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUR0dH36.
Ioannidis 2005a. Contradicted and initially stronger effects in highly cited clinical research (PDF). JAMA 294(2): 218-228.
Ioannidis 2005b. Why most published research findings are false (PDF). PLoS Medicine 2(8): e124.
Peikes, Deborah N., Lorenzo Moreno, and Sean Michael Orzol. 2008. Propensity score matching. American Statistician 62(3): 222-231. Abstract available at http://pubs.amstat.org/doi/abs/10.1198/000313008X332016 (accessed November 29, 2010). Archived by WebCite® at http://www.webcitation.org/5ubSOccyz.
Poverty Action Lab. Methodology: Overview. http://www.povertyactionlab.org/methodology (accessed November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUL7mgH8.
Roodman, David. Rossi's rules (accessed October 19, 2010). David Roodman's Microfinance Open Book Blog, July 13, 2009. Archived by WebCite® at http://www.webcitation.org/5tfUdibYt.
Rossi, H. 1987. The iron law of evaluation and other metallic rules. Research in Social Problems and Public Policy 4: 3–20.
Starita, Laura. Microfinance impact and innovation: Microfinance impacts (accessed November 24, 2010). Philanthropy Action News & Commentary, October 21, 2010. Archived by WebCite® at http://www.webcitation.org/5uUS9HHEn.
Wilde, Elizabeth Ty and Robinson Hollister. 2002. How close is close enough? Testing nonexperimental estimates of impact against experimental estimates of impact with education test scores as outcomes (PDF). Institute for Research on Poverty Discussion Paper no. 1242-02.
Wikipedia. Null hypothesis. http://en.wikipedia.org/wiki/Null_hypothesis (accessed November 24, 2010). Archived by WebCite® at http://www.webcitation.org/5uUS1Wa39.
U.S. Department of Health and Human Services, Administration for Children and Families. Head Start impact study technical report (2010) (PDF).

1
For definition see Poverty Action Lab, "Methodology: Overview."
2
"Over the past 25 years, evaluators of social programs have searched for nonexperimental methods that can substitute effectively for experimental ones. Recently, the spotlight has focused on one method, propensity score matching (PSM), as the suggested approach for evaluating employment and education programs." Peikes, Moreno, and Orzol 2008, Pg 222.
3
Peikes, Moreno, and Orzol 2008, Pg 222.
4
- New York: "Notably, the comparison group approach incorrectly estimated impacts on earnings. For example, the PSM approach estimated statistically significant positive impacts of between $1,000 and $1,200, whereas the experimental estimates were statistically significant and negative for the benefits counseling and waivers group (−$1,080 or −$1,161, depending on the model specification) and −$367 or −$455 and not statistically significant for the benefits counseling, waivers, and employment services group." Peikes, Moreno, and Orzol 2008, Pg 227.
- New Hampshire: "Results for the change in earnings between the year after randomization and the average over the two years before randomization were especially worrisome. We found no effect on the change in earnings when using random assignment: −$597 (p = 0.530) for SSI-concurrent participants and −$512 (p = 0.670) for SSDI-only participants, but we found a very large positive and statistically significant effect when using the propensity score comparison groups: $5,620 ( p = 0.001) and $2,166 (p = 0.047), respectively, for SSI-concurrent and SSDIonly participants. As with the New York example, the inferences we would make about the State Partnership Initiative in New Hampshire would be incorrect if we relied on PSM." Peikes, Moreno, and Orzol 2008, Pg 229.
- Oklahoma: "Turning to impacts on earnings, the PSM comparison group approach generates an estimate with the wrong sign, but because both the PSM and random assignment estimates are not statistically significant, both methods lead to a similar conclusion that there was no impact." Peikes, Moreno, and Orzol 2008, Pg 229.
5
Peikes, Moreno, and Orzol 2008, Pgs 222-223.
6
"To assess nonexperimental (NX) evaluation methods in the context of welfare, job training, and employment services programs, the authors reexamined the results of twelve case studies." Glazerman, Levy, and Myers 2003, Pg 63.
7
"The authors reexamined the results of twelve case studies intended to replicate impact estimates from an experimental evaluation by using NX methods." Glazerman, Levy, and Myers 2003, Pg 63.
8
"To be included in the review, a study had to meet the following criteria…. The intervention’s purpose was to raise participants’ earnings. This criterion restricts our focus to programs that provide job training and employment services…. All of the interventions involved job training or employment services” Glazerman, Levy, and Myers 2003, Pg 68.
9
Glazerman, Levy, and Myers 2003, Pg 74.
10
"The average of the absolute bias over all studies was more than $1,000, which is about 10 percent of annual earnings for a typical population of disadvantaged workers." Glazerman, Levy, and Myers 2003, Pg 86.
11
Glazerman, Levy, and Myers 2003, Pg 86.
12
"Within-Study Comparisons of Impact Estimates for Employment and Training Programs.... Daniel Friedlander and Philip K. Robins (1995) benchmarked nonexperimental methods using data from a series of large-scale random-assignment studies of mandatory welfare-to-work programs in four states." Bloom, Michalopoulos, and Hill 2005, Pg 180, 186; italics in the original.
13
Friedlander and Robins 1995, Pg 935.
14
"The remainder of the chapter measures the selection bias resulting from nonexperimental comparison-group methods by benchmarking them against the randomized experiments that made up the National Evaluation of Welfare-to-Work Strategies (NEWWS), a six-state, seven-site evaluation that investigated different program approaches to moving welfare recipients to work." Bloom, Michalopoulos, and Hill 2005, Pg 194.
15
"With respect to what methods could replace random assignment, we conclude that there are probably none that work well enough in a single replication, because the magnitude of the mismatch bias for any given nonexperimental evaluation can be large. This added error component markedly reduces the likelihood that nonexperimental comparison-group methods could replicate major findings from randomized experiments such as NEWWS. Arguably more problematic is the fact that it is not possible to account for mismatch error through statistical tests or confidence intervals when nonexperimental comparison group methods are used.

Our results offer one ray of hope regarding nonexperimental methods. Although nonexperimental mismatch error can be quite large, it varies unpredictably across evaluations and has an apparent grand mean of 0. A nonexperimental evaluation that used several comparison groups might therefore be able to match a randomized experiment's impact estimate and statistical precision. It is important to recognize, however, that this claim rests on an empirical analysis that might not be generalizable to other settings.... It is possible that comparison-group approaches can be used to construct valid counterfactuals for certain types of programs and certain types of data. Considered in conjunction with related research exploring nonexperimental comparison-group methods, however, the findings presented here suggests that such methods, regardless of their technical sophistication, are no substitute for randomized experiments in measuring the impacts of social and education programs. Thus, we believe that before nonexperimental comparison-group approaches can be accepted as the basis for major policy evaluations, their efficacy needs to be demonstrated by those who would rely on them." Bloom, Michalopoulos, and Hill 2005, Pgs 224-225.
16
"Within-Study Comparisons of Impact Estimates for Education Programs.... The School Dropout Prevention Experiment Roberto Agodini and Mark Dynarski (2004) compared experimental estimates of impacts for dropout prevention programs in eight middle schools and eight high schools with alternative nonexperimental estimates.... Using extensive baseline data, the authors tested propensity-score matching methods, standard OLS regression models, and fixed-effects models.... The Tennessee Class-Size Experiment Elizabeth Ty Wilde and Robinson Hollister (2002) compared experimental and nonexperimental estimates of the impacts on student achievement of reducing class size.... Wilde and Hollister (2002) used propensity-score methods to find matches for the school's program-group students in the pooled sample of control-group students in the other ten schools. The authors also compared experimental impact estimates with nonexperimental estimates obtained from OLS regression methods without propensity-score matching." Bloom, Michalopoulos, and Hill 2005, Pgs 190-191; italics in the original.
17
- "We find no consistent evidence that propensity-score methods replicate experimental impacts of the dropout prevention programs funded by the SDDAP. In fact, we find that evaluating these programs using propensity-score methods might have led to misleading inferences about theeffectiveness of the programs.... We also find that impacts based on regression methods, which are easier to implement, are not any more capable of replicating experimental impacts in this setting than are propensity-score methods." Agodini and Dynarski 2004, Pg 192.
- "We found that in most cases, the propensity-score estimate of the impact differed substantially from the 'true impact' estimated by the experiment. We then attempted to address the question, “How close are the nonexperimental estimates to the experimental ones?” We suggested several different ways of attempting to assess “closeness.” Most of them led to the conclusion, in our view, that the nonexperimental estimates were not very “close” and therefore were not reliable guides as to the “true impact.” We put greatest emphasis on looking at this question of “how close is close enough?” in terms of a decision-maker trying to use the evaluation to determine whether to invest in wider application of the intervention being assessed—in this case reduction in class size. We illustrated this in terms of a rough cost-benefit framework for small class size as applied to Project Star. We found that in 35 to 45 percent of the 11 cases where we had used propensity score matching for the nonexperimental estimate, it would have led to the 'wrong decision,' i.e., a decision about whether to invest which was different from the decision based on the experimental estimates.... The propensity-score-matched
  estimator did not perform notably better than the multiple-regression-corrected estimators for most of the 11 cases.... None of these multiple-regression-correction nonexperimental estimation methods appeared to perform very well where the performance criterion was how close their impact estimate was to the 'true impact' experimental estimate obtained from the random-assignment design.... our conclusion is that nonexperimental estimators do not perform very well when judged by standards of 'how close' they are to the 'true impacts' estimated from experimental estimators based on a random-assignment design." Wilde and Hollister 2002, Pgs 32-34.
18
"A number of meta-analyses ... have compared summaries of findings based on experimental studies with summaries based on nonexperimental studies.... The most extensive such comparison was a meta-analysis of meta-analyses in which Mark W. Lipsey and David B. Wilson (1993) synthesized earlier research on the effectiveness of psychological, education, and behavior treatments. In part of their analysis, they compared the means and standard deviations of experimental and nonexperimental impact estimates from seventy-four meta-analyses for which findings from both types of studies were available. Representing hundreds of primary studies, this comparison revealed little difference between the mean effect estimated on the basis of experimental studies.... Lipsey and Wilson (1993, 1193) concluded:
'These various comparisons do not indicate that it makes no difference to the validity of treatment effect estimates if a primary study uses random versus nonrandom assignment. What these comparisons do indicate is that there is no strong pattern or bias in the direction of the difference made by lower quality methods.... In some treatment areas, therefore, nonrandom designs (relative to random) tend to strongly underestimate effects, and in others, they tend to strongly overestimate effects.'" Bloom, Michalopoulos, and Hill 2005, Pgs 192-193.
19
"This paper analyzes 12 recent within-study comparisons contrasting causal estimates from a randomized experiment with those from an observational study sharing the same treatment group." Cook, Shadish, and Wong 2008, Pg 724.
20
"We identify three studies comparing experiments and regression-discontinuity (RD) studies. They produce quite comparable causal estimates at points around the RD cutoff. We identify three other studies where the quasi-experiment involves careful intact group matching on the pretest. Despite the logical possibility of hidden bias in this instance, all three cases also reproduce their experimental estimates, especially if the match is geographically local. We then identify two studies where the treatment and nonrandomized comparison groups manifestly differ at pretest but where the selection process into treatment is completely or very plausibly known. Here too, experimental results are recreated. Two of the remaining studies result in correspondent experimental and nonexperimental results under some circumstances but not others, while two others produce different experimental and nonexperimental estimates, though in each case the observational study was poorly designed and analyzed. Such evidence is more promising than what was achieved in past within-study comparisons, most involving job training." Cook, Shadish, and Wong 2008, Pg 724.

"Eight of the comparisons produced observational study results that are reasonably close to those of their yoked experiment, and two obtained a close correspondence in some analyses but not others. Only two studies claimed different findings in the experiment and observational study, each involving a particularly weak observational study. Taken as a whole, then, the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature.... RD [regression-discontinuity] is one type of nonequivalent group design, and three studies showed that it produced generally the same causal estimates as experiments.... The basic conclusion, though, is that RD estimates are valid if they result from analyses sensitive to the method’s main assumptions. We can also trust estimates from observational studies that match intact treatment and comparison groups on at least pretest measures of outcome." Cook, Shadish, and Wong 2008, Pg 745.
21
Cook, Shadish, and Wong 2008, Pg 745.
22
"In the job training work, the quasi-experimental design structures were heterogeneous in form and underexplicated relative to the emphasis the researchers placed on statistical models and analytic details. It is as though the studies’ main purpose was to test the adequacy of whatever nonexperimental statistical practice for selection bias adjustment seemed current in job training at the time. This is quite different from trying to test best possible quasi-experimental design and analysis practice, as we have done here." Cook, Shadish, and Wong 2008, Pg 748.
23
Cook, Shadish, and Wong 2008, Pg 746.
24
Duflo and Kremer 2003, Pg 24.
25
- Ioannidis 2005a: "All original clinical research studies published in 3 major general clinical journals or high-impact-factor specialty journals in 1990-2003 and cited more than 1000 times in the literature were examined … Of 49 highly cited original clinical research studies, 45 claimed that the intervention was effective. Of these, 7 (16%) were contradicted by subsequent studies, 7 others (16%) had found effects that were stronger than those of subsequent studies, 20 (44%) were replicated, and 11 (24%) remain largely unchallenged" (abstract).
- Ioannidis 2005b: "Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias." (abstract)
26
- Ioannidis 2005a: "Five of 6 highly cited nonrandomized studies had been contradicted or had found stronger effects [than later results] vs. 9 of 39 randomized controlled trials." (abstract).
- Ioannidis 2005b: "In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes …" (abstract)
27
Phrasing from Wikipedia.
28
"Very low t-statistics appear to be systematically absent--and therefore null hypotheses are overwhelmingly false - only when the universe of null hypotheses considered are the central themes of published economics articles.

This suggests, to us, a publication-bias explanation of our finding. What makes a journal editor choose to publish an article which fails to reject its central null hypothesis, which produces a value of É(a) > 0.1 for its central hypothesis test? The paper must excite the editor's interest along some dimension, and it seems to us that the most likely dimension is that the paper is in apparent contradiction to earlier work on the same topic: either others working along the same line have in the past rejected the same null, or because theory or conventional wisdom suggests a significant relation." De Long and Lang 1992, Pg 13-14.
29
"Fortunately, we can test for reporting bias. The intuition for this test begins by noting that different approaches to estimating the effect of executions on the homicide rate should yield estimates that are somewhat similar. That said, some approaches yield estimates with small standard errors, and hence these should be tightly clustered around the same estimate, while other approaches yield larger standard errors, and hence the estimated effects might be more variable. Thus, there is likely to be a relationship between the size of the standard error and the variability of the estimates, but on average there should be no relationship between the standard error and the estimated effect. By implication, if there is a correlation between the size of the estimate and its standard error, this finding suggests that reported estimates comprise an unrepresentative sample. One simple possibility might be that researchers are particularly likely to report statistically significant results, and thus they only report on estimates that have large standard errors if the estimated effect is also large. If this were true, we would be particularly likely to observe estimates that are at least twice as large as the standard error, and therefore coefficient estimates would be positively correlated with the standard error … the reported estimates appear to be strongly correlated with their standard errors: we find a correlation coefficient of 0.88, which is both large and statistically significant. Second, among studies with designs that yielded large standard errors, only large positive effects are reported, despite the fact that such designs should be more likely to also yield small effects or even large negative effects. And third, we observe very few estimates with t-statistics smaller than two, despite the fact that the estimated deterrent effect required to meet this burden rises with the standard error.

Moreover, while Figure 9 focuses only on the central estimate from each study, Figure 10 shows the pattern of estimated coefficients and standard errors reported within each study. Typically these various estimates reflect an author’s attempt to assess the robustness of the preferred result to an array of alternative specifications. Yet within each of these studies (except Katz, Levitt, and Shustorovich) we find a statistically significant correlation between the standard error of the estimate and its coefficient, which runs counter to one’s expectations from a true sensitivity analysis." Donohue and Wolfers 2006, Pg 839-840.
30
See our review.
31
See our review.
32
Rossi 1987. Excerpted in Roodman.
33
Note that we do discuss direct comparisons of randomized to nonrandomized studies above. However, in these comparisons, the nonrandomized studies are constructed purely for the purpose of comparison to randomized studies, i.e., for methodological reasons and not investigative ones. Therefore, they are not truly "evaluations" of the social programs in question and are not prone to the same concerns about publication bias that evaluations would be.
34
Goldberg 2005. Also see our 2008 review of these studies expressing concerns about selection bias.
35
"It would be hard to read through all of the many positive findings in these dozens of studies - noting how rarely the comparison groups showed better outcomes than clients - and not feel that microfinance is an effective tool for poverty eradication.

On the other hand, considernig all the ways we have seen in which subtle differences between clients and comparisons groups can affect the conclusions we draw, the evidence, as convincing as it is, is not quite good enough. It will be an enormous benefit to the entire industry when the first "incontrovertible" study is published. The only way to achieve this is through randomized control trials. Fortunately, the first of these studies is already underway. While the first use of randomized evaluations may be to prove the effectiveness of microfinance programs, MFI managers, as consumers of information, may soon start to demand randomized trials for informing their management decisions.
36
Banerjee et al. 2009, abstract.
37
"Esther Duflo presented the second set of new data of the morning, “fresh from the oven” in her words. Duflo’s study with the microfinance institution Al Amana took place in rural Morocco in areas previously unserved by formal financial institutions. In all, around 5000 households were captured in the study of the impact of a group liability microcredit product.

Since the people in the study would not have been exposed to formal financial services, the target method was expressly designed to offer services to a higher proportion of people who, based on assessments of baseline data, would be more likely to take-up loans. Despite these efforts and the heavy marketing of the bank, only 16 percent of those who were offered loans took them (interestingly, as with the Karlan data, a lot of the study participants lie in follow-on interviews about having taken a loan – why would they do that?)

So what was the impact on those credit recipients?

The study found no impact on household consumption.
The study found no improvements in welfare.
The study found no effect on the likelihood that a recipient would start a new business.
The study did not show an increased ability to deal with shocks.

The study did find for people who already had a business, however, that loan recipients were more likely to stop engaging in wage work and invest more in their businesses. Livestock owners were more likely to buy more livestock and of a different variety than they had previously owned (so cow farmers diversified with sheep, and vice-versa, creating a de factor savings). And agricultural business sales increased, they took on more employees and those employee wages went up. Non-agricultural businesses did not show the same positive effects and income, on average, did not increase, partly because increases in the household business were offset by the “substitution” effect of decreased wage work." Starita 2010. Note that we refer here to a summary of a conference presentation because the study itself is not yet published.
38
See Currie 2001, Pg 223, Table 2. Context: "there has never been a large-scale, randomized trial of a typical Head Start program, although plans for such a trial are now underway at the U.S. Department of Health and Human Services … Table 2 provides an overview of selected studies, focusing on those which are most recent and prominent and on those which have made especially careful attempts to control for other factors that might affect outcomes" (Currie 2001, Pg 222).
39
See GiveWell Blog, "High-quality study of Head Start early childhood care program," for a summary of the study, which is U.S. Department of Health and Human Services, Administration for Children and Families 2010.