January | 2013 | Jay P. Greene's Blog

What Success Would Have Looked Like

January 10, 2013

Yesterday I described the Gates Foundation’s Measuring Effective Teachers (MET) project as “an expensive flop.” To grasp just what a flop the project was, it’s important to consider what success would have looked like. If the project had produced what Gates was hoping, it would have found that classroom observations were strong, independent predictors of other measures of effective teaching, like student test score gains. Even better, they were hoping that the combination of classroom observations, student surveys, and previous test score gains would be a much better predictor of future test score gains (or of future classroom observations) than any one of those measures alone. Unfortunately, MET failed to find anything like this.

If MET had found classroom observations to be strong predictors of other indicators of effective teaching and if the combination of measures were a significantly better predictor than any one measure alone, then Gates could have offered evidence for the merits of a particular mixing formula or range of mixing formulas for evaluating teachers. That evidence could have been used to good effect to shape teacher evaluation systems in Chicago, LA, and everywhere else.

They also could have genuinely reassured teachers anxious about the use of test score gains in teacher evaluations. MET could have allayed those concerns by telling teachers that test score gains produce information that is generally similar to what is learned from well-conducted classroom observations, so there is no reason to oppose one and support the other. What’s more, significantly improved predictive power from a mixture of classroom observations with test score gains could have made the case for why we need both.

MET was also supposed to have helped us adjudicate among several commonly used rubrics for classroom observations so that we would have solid evidence for preferring one approach over another. Because MET found that classroom observations in general are barely related to other indicators of teacher effectiveness, the study told us almost nothing about the criteria we should use in classroom observations.

In addition, the classroom observation study was supposed to help us identify the essential components of effective teaching . That knowledge could have informed improved teacher training and professional development. But because MET was a flop (because classroom observations barely correlate with other indicators of teacher effectiveness and fail to improve the predictive power of a combined measure), we haven’t learned much of anything about the practices that are associated with effective teaching. If we can’t connect classroom observations with effective teaching in general, we certainly can’t say much about the particular aspects of teaching that were observed that most contributed to effective teaching.

Just so you know that I’m not falsely attributing to MET these goals that failed to be realized, look at this interview from 2011 of Bill Gates by Jason Riley in the Wall Street Journal. You’ll clearly see that Bill Gates was hoping that MET would do what I described above. It failed to do so. Here is what the interview revealed about the goals of MET:

Of late, the foundation has been working on a personnel system that can reliably measure teacher effectiveness. Teachers have long been shown to influence students’ education more than any other school factor, including class size and per-pupil spending. So the objective is to determine scientifically what a good instructor does.

“We all know that there are these exemplars who can take the toughest students, and they’ll teach them two-and-a-half years of math in a single year,” he says. “Well, I’m enough of a scientist to want to say, ‘What is it about a great teacher? Is it their ability to calm down the classroom or to make the subject interesting? Do they give good problems and understand confusion? Are they good with kids who are behind? Are they good with kids who are ahead?’

“I watched the movies. I saw ‘To Sir, With Love,'” he chuckles, recounting the 1967 classic in which Sidney Poitier plays an idealistic teacher who wins over students at a roughhouse London school. “But they didn’t really explain what he was doing right. I can’t create a personnel system where I say, ‘Go watch this movie and be like him.'”

Instead, the Gates Foundation’s five-year, $335-million project examines whether aspects of effective teaching—classroom management, clear objectives, diagnosing and correcting common student errors—can be systematically measured. The effort involves collecting and studying videos of more than 13,000 lessons taught by 3,000 elementary school teachers in seven urban school districts.

“We’re taking these tapes and we’re looking at how quickly a class gets focused on the subject, how engaged the kids are, who’s wiggling their feet, who’s looking away,” says Mr. Gates. The researchers are also asking students what works in the classroom and trying to determine the usefulness of their feedback.

Mr. Gates hopes that the project earns buy-in from teachers, which he describes as key to long-term reform. “Our dream is that in the sample districts, a high percentage of the teachers determine that this made them better at their jobs.” He’s aware, though, that he’ll have a tough sell with teachers unions, which give lip service to more-stringent teacher evaluations but prefer existing pay and promotion schemes based on seniority—even though they often end up matching the least experienced teachers with the most challenging students.

The final MET reports produced virtually nothing that addressed these stated goals. But in Orwellian fashion, the Gates folks have declared the project to be a great success. I never expected MET to work because I suspect that effective teaching is too heterogeneous to be captured well by a single formula. There is no recipe for effective teaching because kids and their needs are too varied, teachers and their abilities are too varied, and the proper matching of student needs and teacher abilities can be accomplished in many different ways. But this is just my suspicion. I can’t blame the Gates Foundation for trying to discover the secret sauce of effective teaching, but I can blame them for refusing to admit that they failed to find it. Even worse, I blame them for distorting, exaggerating, and spinning what they did find.

(edited for typos)

13 Comments | research reports | Tagged: gates foundation, Measuring Effective Teachers | Permalink
Posted by Jay P. Greene

Expulsion Rates in DC

January 10, 2013

(Guest Post by Matthew Ladner)

The Washington Post has an important story up about expulsion rates in DC district and charter schools. I can’t figure out how to embed anything but Youtube videos so the link is here.

Go watch it.

I’ll be here when you get back.

Go on…

Ok good. One important item to note: if we were to go and look up the criminal incident reports we would quickly conclude that the expulsion rate in DCPS is far too low. If I wanted to be cruel, I’d go dig up the crime data. The video specifies that DCPS expelled three students last year, while the charter schools expelled 200.

It seems self-evident to me that 3 was far too low, and it is difficult to know whether 200 is “too many” for the charter sector without a great deal more context. A district where you have to make the FBI Most Wanted List before getting expelled is not a proper baseline for comparison.

Discuss amongst yourselves…

7 Comments | charter schools | Tagged: Charter School Expulsion rates, DC Charter Schools | Permalink
Posted by matthewladner

Understanding the Gates Foundation’s Measuring Effective Teachers Project

January 9, 2013

If I were running a school I’d probably want to evaluate teachers using a mixture of student test score gains, classroom observations, and feedback from parents, students, and other staff. But I recognize that different schools have different missions and styles that can best be assessed using different methods. I wouldn’t want to impose on all schools in a state or the nation a single, mechanistic system for evaluating teachers since that is likely to be a one size fits none solution. There is no single best way to evaluate teachers, just like there is no single best way to educate students.

But the folks at the Gates Foundation, afflicted with PLDD, don’t see things this way. They’ve been working with politicians in Illinois, Los Angeles, and elsewhere to centrally impose teacher evaluation systems, but they’ve encountered stiff resistance. In particular, they’ve noticed that teachers and others have expressed strong reservations about any evaluation system that relies too heavily on student test scores.

So the folks at Gates have been trying to scientifically validate a teacher evaluation system that involves a mix of test score gains, classroom observations, and student surveys so that they can overcome resistance to centrally imposed, mechanistic evaluation systems. If they can reduce reliance on test scores in that system while still carrying the endorsement of “science,” the Gates folk imagine that politicians, educators, and others will all embrace the Gates central planning fantasy.

Let’s leave aside for the moment the political reality, demonstrated recently in Chicago and Los Angeles, that teachers are likely to fiercely resist any centrally imposed, mechanistic evaluation system regardless of the extent to which it relies on test scores. The Gates folks want to put on their lab coats and throw the authority of science behind a particular approach to teacher evaluation. If you oppose it you might as well deny global warming. Science has spoken.

So it is no accident that the release of the third and final round of reports from the Gates Foundation’s Measuring Effective Teachers project was greeted with the following headline in the Washington Post: “Gates Foundation study: We’ve figured out what makes a good teacher,” or this similarly humble claim in the Denver Post: “Denver schools, Gates foundation identify what makes effective teacher.” This is the reaction that the Gates Foundation was going for — we’ve used science to discover the correct formula for evaluating teachers. And by implication, we now know how to train and improve teachers by using the scientifically validated methods of teaching.

The only problem is that things didn’t work out as the Gates folks had planned. Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system. You have to dig to find this, but it’s right there in Table 1 on page 10 of one of the technical reports released yesterday. In a regression to predict student test score gains using out of sample test score gains for the same teacher, student survey results, and classroom observations, there is virtually no relationship between test score gains and either classroom observations or student survey results. In only 3 of the 8 models presented is there any statistically significant relationship between either classroom observations or student surveys and test score gains (I’m excluding the 2 instances were they report p < .1 as statistically significant). And in all 8 models the point estimates suggest that a standard deviation improvement in classroom observation or student survey results is associated with less than a .1 standard deviation increase in test score gains.

Not surprisingly, a composite teacher evaluation measure that mixes classroom observations and student survey results with test score gains is generally no better and sometimes much worse at predicting out of sample test score gains. The Gates folks trumpet the finding that the combined measures are more “reliable” but that only means that they are less variable, not any more predictive.

But “the best mix” according to the “policy and practitioner brief” is “a composite with weights between 33 percent and 50 percent assigned to state test scores.” How do they know this is the “best mix?” It generally isn’t any better at predicting test score gains. And to collect the classroom observations involves an enormous expense and hassle. To get the measure as “reliable” as they did without sacrificing too much predictive power, the Gates team had to observe each teacher at least four different times by at least two different coders, including one coder outside of the school. To observe 3.2 million public school teachers for four hours by staff compensated at $40 per hour would cost more than $500 million each year. The Gates people also had to train the observers at least 17 hours and even after that had to throw out almost a quarter of those observers as unreliable. To do all of this might cost about $1 billion each year.

And what would we get for this billion? Well, we might get more consistent teacher evaluation scores, but we’d get basically no improvement in the identification of effective teachers. And that’s the “best mix?” Best for what? It’s best for the political packaging of a centrally imposed, mechanistic teacher evaluation system, which is what this is all really about. Vicki Phillips, who heads the Gates education efforts, captured in this comment what I think they are really going for with a composite evaluation score:

Combining all three measures into a properly weighted index, however, produced a result “teachers can trust,” said Vicki Phillips, a director in the education program at the Gates Foundation.

It’ll cost a fortune, it doesn’t improve the identification of effective teachers, but we need to do it to overcome resistance from teachers and others. Not only will this not work, but in spinning the research as they have, the Gates Foundation is clearly distorting the straightforward interpretation of their findings: a mechanistic system of classroom observation provides virtually nothing for its enormous cost and hassle. Oh, and this is the case when no stakes were attached to the classroom observations. Once we attach all of this to pay or continued employment, their classroom observation system will only get worse.

I should add that if classroom observations aren’t useful as predictors, they also can’t be used effectively for diagnostic purposes. An earlier promise of this project is that they would figure out which teacher evaluation rubrics were best and which sub-components of those rubrics that were most predictive of effective teaching. But that clearly hasn’t panned out. In the new reports I can’t find anything about the diagnostic potential of classroom observations, which is not surprising since those observations are not predictive.

So, rather than having “figured out what makes a good teacher” the Gates Foundation has learned very little in this project about effective teaching practices. The project was an expensive flop. Let’s not compound the error by adopting this expensive flop as the basis for centrally imposed, mechanistic teacher evaluation systems nationwide.

(Edited for typos and to add links. To see a follow-up post, click here.)