Over the weekend I posted about how I thought the Gates Foundation was spinning the results of their Measuring Effective Teachers Project to suggest that the combination of student achievement gains, student surveys, and classroom observations was the best way to have a predictive measure of teacher effectiveness. Let me anticipate some of the responses they may have:
1) They might say that they clearly admit the limitations of classroom observations and therefore are not guilty of spinning the results to inflate their importance. They could point to p. 15 of the research paper in which they write: “When value-added data are available, classroom observations add little to the ability to predict value-added gains with other groups of students. Moreover, classroom observations are less reliable than student feedback, unless many different observations are added together.”
Response: I said in my post over the weekend that the Gates folks were careful so that nothing in the reports is technically incorrect. The distortion of their findings comes from the emphasis and manner of presentation. For example, the summary of findings in the research paper on p. 9 states: “Combining observation scores with evidence of student achievement gains and student feedback improved predictive power and reliability.” Or the “key findings” in the practitioner brief on p. 5 say: “”Observations alone, even when scores from multiple observations were averaged together, were not as reliable or predictive of a teacher’s student achievement gains with another group of students as a measure that combined observations with student feedback and achievement gains on state tests.” Notice that these summaries of the results fail to mention the most straightforward and obvious finding: classroom observations are really expensive and cumbersome and yet do almost nothing to improve the predictiveness of student achievement-based measures of teacher quality.
And the proof that the results are being spun is that the media coverage uniformly repeats the incorrect claim that multiple measures are an important improvement on test scores alone. Either all of the reporters are lousy and don’t understand the reports or the reporters are accurately repeating what they are being told and what they overwhelmingly see in the reports. My money is on the latter explanation.
And further proof that the reporters are being spun is that Vicki Phillips, the Gates education chief, is quoted in the LA Times coverage mis-characterizing the findings: “Using these methods to evaluate teachers is ‘more predictive and powerful in combination than anything we have used as a proxy in the past,’ said Vicki Phillips, who directs the Gates project.” This is just wrong. As I pointed out in my previous post, the combined measure is no more predictive than student achievement by itself.
Lastly, the standard for fair and accurate reporting of results is not whether one could find any way to show that technically the description of findings is not false. We should expect the most straightforward and obvious description of findings emphasized. With the Gates folks I feel like I am repeatedly parsing what the meaning of the word “is” is. That’s political spin, not research.
2) They might say that classroom observations are an important addition because at least they provide diagnostic information about how teachers can improve, while test scores cannot.
Response: This may be true, but it is not a claim supported by the Gates study. They found that all of the different classroom observation methods they tried had very weak predictive power. You can’t provide a lot of feedback about how to improve student achievement based on instruments that are barely correlated with gains in student achievement. In addition, they were unable to find sub-components of the classroom observation methods that were more predictive, so they can’t tell teachers that they really need to do certain things, since those things are much more strongly related to student learning gains. Lastly, it is simply untrue that test scores cannot be diagnostic. There are sub-components of the tests that measure learning in different aspects of the subject. Teachers could be told to emphasize more those areas on which their students have lagged.
3) They may say that classroom observations and students surveys improve the reliability of a teacher quality measure when combined with test scores.
Response: An increase in reliability is cold comfort for a lack of predictive power. Reliability is just an indicator of how consistent a measure is. There are plenty of measures that are very consistent but not helpful in predicting teacher quality. For example, if we asked students to rate how attractive their teacher was, we would probably get a very “reliable” (consistent) measure from year to year and section to section. But that consistency would not make up for the fact that attractiveness is unlikely to help improve the prediction of effective teaching. So, the student survey has a high amount of consistency, but who knows what that is really measuring since it is only weakly related to student learning gains. It is consistent, but consistently wrong. Our focus should be on the predictive power of teacher evaluations and classrooms observations and student surveys don’t really do anything to help with that (at least, not according to the Gates study).
4) They may say that classroom observations and student surveys improve on the prediction of student effort and classroom environment.
Response: As I mentioned in the post over the weekend, they don’t really have validated measures of student effort and classroom environment. The Gates folks took a lot of flack last year for focusing on test-score gains, so they came up with some non-test score outcome measures simply by taking some of the items from the students survey where students are asked about their effort or classroom environment. We have no idea whether they have really measured the amount of effort students exert or the quality of the classroom environment, they are just using some survey answers on those items and claiming that they have measured those “outcomes.” The only validated outcome measure we have in the Gates study are the test score gains, so we have to focus on that.
—————————————————————————————————
The good news is that my fears about the Gates study being used to dictate what teachers do have not been realized, at least not yet. But it wasn’t for lack of trying. If the classroom observations had worked a little better in predicting student learning gains, I’m sure we would have heard about how teachers should run their classrooms to produce greater gains. But the classroom observations were so much of a dud that gates education chief, Vicki Phillips, didn’t even attempt to claim that they found that drill and kill is bad or that teachers should avoid teaching to the test.
But the inability to use the classroom observations to tell teachers the “right” way of teaching is another way of saying that the classroom observations are not able to be used for diagnostic purposes. The most straightforward reading of the Gates results is that classroom observations appear to be an expensive and ineffective dud. But it’s hard for an organization that spends $45 million on a project to scientifically validate classroom observations to admit that it failed. It’s hard enough for a third-party evaluator to say that, let alone an in-house study about a key aspect of the Gates policy agenda.
Posted by Jay P. Greene 
(Guest Post by Matthew Ladner)









