Confusing Evidence and Politics

In education reform debates it is far too common to hear someone say “the evidence shows” something that is just their preferred policy that is not supported by research at all.  People confuse what makes good sense and is good politics with what is actually supported by evidence.  At the Gates Foundation this problem is endemic.  They have repeatedly confused evidence and politics.

I think I can clearly illustrate this confusion of evidence and policy preference at the Gates Foundation in the most recent article by Tom Kane in Education Next summarizing the Measuring Effective Teachers (MET) project results.   MET is an ambitious project to record several thousand classroom lessons, survey students, and administer multiple standardized tests to identify the best way to measure teacher effectiveness and eventually identify teaching practices that are associated with greater learning.  The study costs $45 million on top of the $335 million reported cost of implementing the program in several school districts.

The main claimed finding of MET at this point is that combining classroom observation and student survey scores with student achievement gains is the best way to measure teacher effectiveness.  As Kane writes, “the evidence reveals that…  rather than rely on any single indicator, schools should try to see effective teaching from multiple angles.”  I’m willing to agree with Kane that using multiple measures of teacher effectiveness is supported by political wisdom and sound theory, but the evidence they produced does not demonstrate the merits of multiple measures.

Kanes summarizes “the case for multiple measures” in the second to last section of his article.  He states, “First, combining [multiple measures] generates less volatility from course section to section or year to year, and greater predictive power.” The results don’t exactly support this claim.  As can be seen below in Figure 1 reproduced from his article and in Table 16 on p. 51 of the Measuring Effective Teachers report, an equally weighted combination of student achievement gains with classroom observation scores and student survey results actually lowers predictive power.  You are better at predicting teacher value added in a class just by using the teacher value added measure from another class than by combining that achievement gain measure equally with classroom observation scores and student survey responses, which is the opposite of “combining them generates… greater predictive power.”

The only way using multiple measures could have a roughly equal predictive power to achievement gains alone is if they are combined such that achievement gains constitute 75.8% of the combined measure, with only 4.2% of the combined measure coming from classroom observation scores and 20.0% coming from student surveys.

But there are significant difficulties and costs associated with collecting the classroom observation scores.  Every observer had to receive 17 to 25 hours of training and even after that 23% of the observers had to be excluded for lack of reliability.  And then as Kane acknowledges: “Even with trained raters, we had to score four lessons, each by a different observer, and average those scores to get a reliable measure of a teacher’s practice. Given the high opportunity cost of a principal’s time, or the salaries of professional peer observers, classroom observations are the costliest source of feedback.”  All of this was necessary for a measure that constituted 4.2% of a combined measure with about the same predictive power as forgetting about classroom observations and just using achievement gains.

Someone reviewing this evidence who was not already committed to the policy of using multiple measures would obviously conclude that classroom observations were not worth the significant expense and bother.  The conclusion Kane and Gates offer is not driven by “the evidence” but by their preference for a policy that is based on other political and theoretical reasons.

I should note that the increase in reliability from combining measures of teacher effectiveness provides little consolation.  Kane measures reliability as the correlation of the evaluation score from class to class for the same teacher.  You could improve reliability without improving or even while hurting predictive power simply by adding another  variable to the combined measure.  It would be more consistent (reliable), but it would be more consistently wrong.

Kane then offers another argument for combining measures: “A second reason to combine the measures is to reduce the risk of unintended consequences, to lessen the likelihood of manipulation or ‘gaming.’ Whenever one places all the stakes on any single measure, the risk of distortion and abuse goes up.”  This makes very good sense and is a persuasive argument.  The only problem is that it is not in any way derived from “the evidence” produced by their study.  It’s just a sound theoretical argument.  Kane and Gates shouldn’t say “the evidence” supports multiple measures when they aren’t actually relying on evidence to make their claim.  They didn’t need to spend almost $400 million to implement and study MET to make this point.

And finally, Kane suggests that “[t]here is a third reason to collect multiple measures: conflicting messages from the multiple sources of information send a signal to supervisors that they should take a close look at what’s going on in the classroom.”  Again, this is a theoretical argument rather than from any evidence the study collected.  And unfortunately, the evidence from the study suggests that a combined message will send “conflicting messages” almost all of the time.  The correlation between classroom observation or student survey scores and achievement gains was no higher than .13.  With such a low correlation, administrators will very often see differences between teacher effectiveness as measured by each of the three types of measures.  Kane might as well suggest that supervisors should always take a close look at every teacher.

I’m inclined to agree with Kane and Gates that it is better to use multiple measures when evaluating teacher effectiveness.  I just don’t see how “the evidence” does anything to support this view.  The argument for multiple measures is largely theoretical and political.  Theory suggests that a single measure is more subject to manipulation and unwanted distortion in teacher behavior.  And politics suggests that teachers will be more resistant to any system that is based solely on test scores.  These are all fine reasons for supporting multiple measures, we just shouldn’t debase the currency of research by falsely claiming that they are supported by the evidence when the evidence shows no such thing.  They are just confusing evidence and policy preferences justified by considerations that have nothing to do with research. The truth is that MET was a very expensive effort that failed to produce the evidence they wanted, but in Orwellian fashion they declare victory and still say it supports their pre-determined conclusion.

Unfortunately, there is a pattern at  Gates of this abuse of “evidence” and “research” to support preferred policies.  I’ve written a number of posts in the past detailing this problem at Gates.  In addition to spinning the multiple measures claim, I’ve pointed out that they falsely claimed that their student survey results showed that “[t]eaching to the test makes your students do worse on the tests.”  I noted that Gates was again indifferent to evidence when they abandoned their small schools strategy without waiting for the results of a random-assignment evaluation that ultimately showed that small schools were effective.  And Gates has backed the push for Common Core standards with phony science.

Unlike other critics of the Gates Foundation, I am not motivated by the belief that it is illegitimate for billionaires to use their wealth to try to advance education reform.  On the contrary, I’ve focused on Gates because I believe they are squandering their great potential to have a positive impact.  I’d like them to do better.

But even more importantly, I’ve harped on these abuses of “evidence” and “research” to advance the Gates policy agenda because I fear that Gates is undermining the use of real evidence and research by others to positively influence policy.  I understand that people and organizations can favor policies without having the evidence to prove their merit.  But I cannot understand or accept abusing the idea of evidence and research to advance preferred policies.  Doing so ruins the use of evidence by everyone by feeding the cynical belief that all research is just a way to manipulate others to get what you want.  The more that the general credibility of research and evidence are damaged, the more that policy outcomes will be determined by the brute power of involved interests, which means that the unions are more likely to prevail.  A belief in research and evidence is the only way for weaker interests to triumph, so it is essential that the ed reform movement not debase their own currency.

I still have hope that Gates can right their ship.  It won’t be easy, but they can take important steps to change their organizational culture and structure so that they do not repeatedly abuse claims of research and evidence for their policy preferences.  In the next post, I’ll explain what they should do to reform themselves.

(Edited to correct typo)

