
If I were running a school I’d probably want to evaluate teachers using a mixture of student test score gains, classroom observations, and feedback from parents, students, and other staff. But I recognize that different schools have different missions and styles that can best be assessed using different methods. I wouldn’t want to impose on all schools in a state or the nation a single, mechanistic system for evaluating teachers since that is likely to be a one size fits none solution. There is no single best way to evaluate teachers, just like there is no single best way to educate students.
But the folks at the Gates Foundation, afflicted with PLDD, don’t see things this way. They’ve been working with politicians in Illinois, Los Angeles, and elsewhere to centrally impose teacher evaluation systems, but they’ve encountered stiff resistance. In particular, they’ve noticed that teachers and others have expressed strong reservations about any evaluation system that relies too heavily on student test scores.
So the folks at Gates have been trying to scientifically validate a teacher evaluation system that involves a mix of test score gains, classroom observations, and student surveys so that they can overcome resistance to centrally imposed, mechanistic evaluation systems. If they can reduce reliance on test scores in that system while still carrying the endorsement of “science,” the Gates folk imagine that politicians, educators, and others will all embrace the Gates central planning fantasy.
Let’s leave aside for the moment the political reality, demonstrated recently in Chicago and Los Angeles, that teachers are likely to fiercely resist any centrally imposed, mechanistic evaluation system regardless of the extent to which it relies on test scores. The Gates folks want to put on their lab coats and throw the authority of science behind a particular approach to teacher evaluation. If you oppose it you might as well deny global warming. Science has spoken.
So it is no accident that the release of the third and final round of reports from the Gates Foundation’s Measuring Effective Teachers project was greeted with the following headline in the Washington Post: “Gates Foundation study: We’ve figured out what makes a good teacher,” or this similarly humble claim in the Denver Post: “Denver schools, Gates foundation identify what makes effective teacher.” This is the reaction that the Gates Foundation was going for — we’ve used science to discover the correct formula for evaluating teachers. And by implication, we now know how to train and improve teachers by using the scientifically validated methods of teaching.
The only problem is that things didn’t work out as the Gates folks had planned. Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system. You have to dig to find this, but it’s right there in Table 1 on page 10 of one of the technical reports released yesterday. In a regression to predict student test score gains using out of sample test score gains for the same teacher, student survey results, and classroom observations, there is virtually no relationship between test score gains and either classroom observations or student survey results. In only 3 of the 8 models presented is there any statistically significant relationship between either classroom observations or student surveys and test score gains (I’m excluding the 2 instances were they report p < .1 as statistically significant). And in all 8 models the point estimates suggest that a standard deviation improvement in classroom observation or student survey results is associated with less than a .1 standard deviation increase in test score gains.
Not surprisingly, a composite teacher evaluation measure that mixes classroom observations and student survey results with test score gains is generally no better and sometimes much worse at predicting out of sample test score gains. The Gates folks trumpet the finding that the combined measures are more “reliable” but that only means that they are less variable, not any more predictive.
But “the best mix” according to the “policy and practitioner brief” is “a composite with weights between 33 percent and 50 percent assigned to state test scores.” How do they know this is the “best mix?” It generally isn’t any better at predicting test score gains. And to collect the classroom observations involves an enormous expense and hassle. To get the measure as “reliable” as they did without sacrificing too much predictive power, the Gates team had to observe each teacher at least four different times by at least two different coders, including one coder outside of the school. To observe 3.2 million public school teachers for four hours by staff compensated at $40 per hour would cost more than $500 million each year. The Gates people also had to train the observers at least 17 hours and even after that had to throw out almost a quarter of those observers as unreliable. To do all of this might cost about $1 billion each year.
And what would we get for this billion? Well, we might get more consistent teacher evaluation scores, but we’d get basically no improvement in the identification of effective teachers. And that’s the “best mix?” Best for what? It’s best for the political packaging of a centrally imposed, mechanistic teacher evaluation system, which is what this is all really about. Vicki Phillips, who heads the Gates education efforts, captured in this comment what I think they are really going for with a composite evaluation score:
Combining all three measures into a properly weighted index, however, produced a result “teachers can trust,” said Vicki Phillips, a director in the education program at the Gates Foundation.
It’ll cost a fortune, it doesn’t improve the identification of effective teachers, but we need to do it to overcome resistance from teachers and others. Not only will this not work, but in spinning the research as they have, the Gates Foundation is clearly distorting the straightforward interpretation of their findings: a mechanistic system of classroom observation provides virtually nothing for its enormous cost and hassle. Oh, and this is the case when no stakes were attached to the classroom observations. Once we attach all of this to pay or continued employment, their classroom observation system will only get worse.
I should add that if classroom observations aren’t useful as predictors, they also can’t be used effectively for diagnostic purposes. An earlier promise of this project is that they would figure out which teacher evaluation rubrics were best and which sub-components of those rubrics that were most predictive of effective teaching. But that clearly hasn’t panned out. In the new reports I can’t find anything about the diagnostic potential of classroom observations, which is not surprising since those observations are not predictive.
So, rather than having “figured out what makes a good teacher” the Gates Foundation has learned very little in this project about effective teaching practices. The project was an expensive flop. Let’s not compound the error by adopting this expensive flop as the basis for centrally imposed, mechanistic teacher evaluation systems nationwide.
(Edited for typos and to add links. To see a follow-up post, click here.)