Matching Method and the Gold Standard

Anna Egalite and Matthew Ackerman have a new study out that examines whether the matching methodology used by CREDO to evaluate charter schools is “a reasonable alternative when the gold standard is not feasible or possible.” They conclude that it is. Using data from FL, they consider and rebut a series of common criticisms that have been made against the CREDO methodology.

They find that using multiple students when matching does not change results much from using a single match. They also find that matching on administrative classifications, like special education and English language learner, also does not distort results much even though those classifications are systematically different across sectors. And they find that more rigorous methodologies, like using exogenous instruments, yield similar results in FL to using CREDO’s matching method.

Anna and Matthew have done excellent work and convincingly demonstrated their case. Since Anna is a former student, who is now an Assistant Professor at North Carolina State (via a post-doc at Harvard), and another former student of mine, James (Lynn) Woodworth, is a researcher at CREDO and author of reports using this methodology, this superb analysis of CREDO’s approach fills me with pride in their accomplishments.

But I’m concerned that they or others may over-interpret what this study finds. It does not demonstrate that matching generally gives you the same result as randomized experiments or other gold standard methodologies. All that it demonstrates is that matching yielded similar results in this particular context. In this circumstance, the selection of students into charter schools did not produce important differences between treatment and control students on unobserved characteristics. And in this case, systematic differences in how charter and traditional public schools classify students into special ed, ELL, and free lunch did not bias the result. But the next time we use a matching methodology, the situation could be completely different. In the next matching study, the types of students who attend charters may be significantly different in unobserved ways and administrative classifications could produce strong bias.

People have a very bad habit of declaring that matching or another observational method is just as good as gold-standard research designs whenever the two produce similar results. They did this after Abdulkadiroğlu, et al produced their Boston charter results. But declaring that both methods are just as good ignores why we have gold-standard research in the first place. The bias of observational methods is typically unobserved. And those biases certainly exist some of the time even if they are not present all of the time. Finding similar results for matching methods in one circumstance does not erase this fact.

To their credit, Ackerman and Egalite are careful to emphasize that matching should only be considered when more rigorous approaches are not available. My strong preference is that we should avoid sub-par methodologies, especially when the same policy has been subject to at least some gold-standard evaluations. We don’t need a study on every charter school in every state. We should rely on the rigorous research where we have it and then extrapolate those results to other schools and states. I’d rather be guided by theory supported by rigorous evidence than demand sub-par evidence for all things. Demanding evidence for every school in every state gives us a false sense of confidence that we really know how each state and school are doing.

Unfortunately, in their drive to make “evidence-based” decisions and feel “scientific,” ed reform policymakers and leaders have demanded that evidence be produced for each school in each state. Some have gone so far as to demand evidence on the effectiveness of each teacher. We can’t produce rigorous evidence all of the time, so these demands for evidence are driving us toward lower quality research designs. That may produce unbiased results some of the time, but it certainly won’t all of the time. So, in the desire to be evidence-based and scientific we are likely to undermine the quality of evidence and science. Let’s stick to gold-standard work for policy questions where we have those studies.

This entry was posted on Monday, November 9th, 2015 at 5:10 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

9 Responses to Matching Method and the Gold Standard

matthewladner says:

November 9, 2015 at 5:28 pm

The virtual charter study from Credo may have been an example of just the sort of over-confidence you describe if enrolling in a virtual charter school is often a substitute for dropping out of school entirely. If that is the case, no amount of demographic matching would create a reliable comparison group.

Reply
George Mitchell says:

November 9, 2015 at 6:04 pm

In the criminal justice field the words “evidence-based” are an omnipresent signal that one dare not question an assertion. To the extent this has not happened in the K-12 world it surely will. This is due in large part to the failure of reporters to ask “what evidence” and the failure of the same reporters to grasp basic principles of social science research.

Reply
Anna Egalite says:

November 10, 2015 at 10:59 am

Thanks for your input, Jay. Your points are well taken. Sure, we were able to address some of the most salient criticisms of the CREDO model and found that, in this context, they didn’t matter as much as we might have thought. But that’s not to say that a parallel study in a different context would turn out the same. The unobservable biases that an RCT overcomes are, well, unobservable, and an RCT is the only research design that is foolproof enough to ensure these biases don’t influence the results.

Big picture, though. I think your broader point is to question why we need so many evaluations of charter schools in the first place, and to that I’m sympathetic.

Reply
- George Mitchell says:
  
  November 10, 2015 at 11:07 am
  
  In my experience “more” studies are needed to convince policy makers that there are ways to improve K-12 results. The general failure of quality studies to further that goal is explained in part by the relentless efforts of reform opponents to distort quality research and to produce their own flawed “studies.” Most elected officials don’t have the time or inclination to sort through the work and distinguish quality from garbage. This means the bad guys win, as they understand full well.
  
  Many respected researchers who do quality work are loathe to engage in a serious effort to communicate with policy makers. They want to steer clear of “politics” and let the evidence speak for itself in journals that elected officials rarely see. They should re-evaluate their strategy.
  
  Reply
  - Anna Egalite says:
    
    November 10, 2015 at 11:12 am
    
    To your latter point- I agree, George. This is why I really value the role that Education Next plays by serving as an outlet for researchers to communicate their findings in a format that practitioners and policymakers find accessible and interesting.
- Jay P. Greene says:
  
  November 10, 2015 at 1:49 pm
  
  Thanks, Anna. I think the motivation to produce research for each state, district, and school is a political one. Policymakers seem to need to hear the name of their state, district, or school in the study to believe it. In addition, foundation boards seem to want “proof” that each state, district, and school where they spend money is benefiting. It’s as if they are saying, “Sure polio vaccines have positive results in your rigorous experiment, but do they help kids in New Orleans?”
  
  Reply
  - George Mitchell says:
    
    November 10, 2015 at 1:53 pm
    
    Jay,
    
    Apropos to your polio illustration, unless I am mistaken the Milwaukee print media has never, and I mean never, reported on the gold standard choice studies outside Wisconsin. Those studies obviously pertain to the efficacy of choice, but if the don’t involve WI programs they are deemed not relevant. And then, of course, the Milw media consistently misreport on the studies that do involve WI.
Minnesota Kid says:

November 10, 2015 at 11:28 am

There is a logical tension in your argument, Jay. You essentially claim that matching-validation tests, such as the one that Anna and Matthew Have done in Florida (and Bob Bifulco has done in North Carolina) lack external validity. Matching is a valid substitute for random assignment in those two states but not necessarily elsewhere. Fair enough. But then you argue that random assignment studies don’t need to meet such a standard for external validity. If a few isolated experiments show something is true, we should expect it to be true everywhere and move on. That’s a difficult double-standard to maintain, especially since most education experiments (1) are only of programs when they first launch, (2) often include just a small fraction of all program participants (less than 20 percent in some cases), since many students are not subject to lotteries, (3) disproportionately involve distinctive populations of students (e.g. urban students, minorities), etc. If you are concerned about the external validity of the matching-validation studies in Florida and NC you should be ten-times more concerned about the external validity of the small urban school choice experiments that you and others quite sensibly champion (for their internal validity).

Reply
- Jay P. Greene says:
  
  November 10, 2015 at 2:03 pm
  
  Actually, my objection to the validation exercise is really one of internal validity. Even if matching and RCT produce a similar result in the same place, we don’t know that it is because matching is unaffected by unobservable differences. The similarity could be the product of luck or perhaps different unobservable factors produced biases in different directions that balanced each other out. The problem of unobserved differences is inherent in observational methods and there is no validation exercise that has the internal validity to prove that this is not a problem — even in that one circumstance. Any weak methodology can produce the “right” result even when it lacks internal validity.
  
  The solution to your concerns about the limits of current experiments is to conduct more rigorous experiments. But we don’t need (and could never produce) a rigorous experiment in every state, district, and school. We just need enough to make sure we don’t just assess programs when they are created, that we can say something about more than a small fraction, and have experiments from a few different circumstances. The wrong response is to generate a high volume of lower quality studies that may well give us wrong answers, fill us with false confidence that we know things that we don’t know, and distract energy and resources from producing the limited number of rigorous experiments we actually need.
  
  Reply