Understanding the Gates Foundation’s Measuring Effective Teachers Project

If I were running a school I’d probably want to evaluate teachers using a mixture of student test score gains, classroom observations, and feedback from parents, students, and other staff.  But I recognize that different schools have different missions and styles that can best be assessed using different methods.  I wouldn’t want to impose on all schools in a state or the nation a single, mechanistic system for evaluating teachers since that is likely to be a one size fits none solution.  There is no single best way to evaluate teachers, just like there is no single best way to educate students.

But the folks at the Gates Foundation, afflicted with PLDD, don’t see things this way.  They’ve been working with politicians in Illinois, Los Angeles, and elsewhere to centrally impose teacher evaluation systems, but they’ve encountered stiff resistance.  In particular, they’ve noticed that teachers and others have expressed strong reservations about any evaluation system that relies too heavily on student test scores.

So the folks at Gates have been trying to scientifically validate a teacher evaluation system that involves a mix of test score gains, classroom observations, and student surveys so that they can overcome resistance to centrally imposed, mechanistic evaluation systems.  If they can reduce reliance on test scores in that system while still carrying the endorsement of “science,” the Gates folk imagine  that politicians, educators, and others will all embrace the Gates central planning fantasy.

Let’s leave aside for the moment the political reality, demonstrated recently in Chicago and Los Angeles, that teachers are likely to fiercely resist any centrally imposed, mechanistic evaluation system regardless of the extent to which it relies on test scores.  The Gates folks want to put on their lab coats and throw the authority of science behind a particular approach to teacher evaluation.  If you oppose it you might as well deny global warming.  Science has spoken.

So it is no accident that the release of the third and final round of reports from the Gates Foundation’s Measuring Effective Teachers project was greeted with the following headline in the Washington Post: “Gates Foundation study: We’ve figured out what makes a good teacher,”  or this similarly humble claim in the Denver Post: “Denver schools, Gates foundation identify what makes effective teacher.”  This is the reaction that the Gates Foundation was going for — we’ve used science to discover the correct formula for evaluating teachers.  And by implication, we now know how to train and improve teachers by using the scientifically validated methods of teaching.

The only problem is that things didn’t work out as the Gates folks had planned.  Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system.  You have to dig to find this, but it’s right there in Table 1 on page 10 of one of the technical reports released yesterday.  In a regression to predict student test score gains using out of sample test score gains for the same teacher, student survey results, and classroom observations, there is virtually no relationship between test score gains and either classroom observations or student survey results.  In only 3 of the 8 models presented is there any statistically significant relationship between either classroom observations or student surveys and test score gains (I’m excluding the 2 instances were they report p < .1 as statistically significant).  And in all 8 models the point estimates suggest that a standard deviation improvement in classroom observation or student survey results is associated with less than a .1 standard deviation increase in test score gains.

Not surprisingly, a composite teacher evaluation measure that mixes classroom observations and student survey results with test score gains is generally no better and sometimes much worse at predicting out of sample test score gains.  The Gates folks trumpet the finding that the combined measures are more “reliable” but that only means that they are less variable, not any more predictive.

But “the best mix” according to the “policy and practitioner brief” is “a composite with weights between 33 percent and 50 percent assigned to state test scores.”  How do they know this is the “best mix?”  It generally isn’t any better at predicting test score gains.  And to collect the classroom observations involves an enormous expense and hassle.  To get the measure as “reliable” as they did without sacrificing too much predictive power, the Gates team had to observe each teacher at least four different times by at least two different coders, including one coder outside of the school.  To observe 3.2 million public school teachers for four hours by staff compensated at $40 per hour would cost more than $500 million each year.  The Gates people also had to train the observers at least 17 hours and even after that had to throw out almost a quarter of those observers as unreliable.  To do all of this might cost about $1 billion each year.

And what would we get for this billion?  Well, we might get more consistent teacher evaluation scores, but we’d get basically no improvement in the identification of effective teachers.  And that’s the “best mix?”  Best for what?  It’s best for the political packaging of a centrally imposed, mechanistic teacher evaluation system, which is what this is all really about.  Vicki Phillips, who heads the Gates education efforts, captured in this comment what I think they are really going for with a composite evaluation score:

Combining all three measures into a properly weighted index, however, produced a result “teachers can trust,” said Vicki Phillips, a director in the education program at the Gates Foundation.

It’ll cost a fortune, it doesn’t improve the identification of effective teachers, but we need to do it to overcome resistance from teachers and others.  Not only will this not work, but in spinning the research as they have, the Gates Foundation is clearly distorting the straightforward interpretation of their findings: a mechanistic system of classroom observation provides virtually nothing for its enormous cost and hassle.  Oh, and this is the case when no stakes were attached to the classroom observations.  Once we attach all of this to pay or continued employment, their classroom observation system will only get worse.

I should add that if classroom observations aren’t useful as predictors, they also can’t be used effectively for diagnostic purposes.  An earlier promise of this project is that they would figure out which teacher evaluation rubrics were best and which sub-components of those rubrics that were most predictive of effective teaching.  But that clearly hasn’t panned out.  In the new reports I can’t find anything about the diagnostic potential of classroom observations, which is not surprising since those observations are not predictive.

So, rather than having “figured out what makes a good teacher” the Gates Foundation has learned very little in this project about effective teaching practices.  The project was an expensive flop.  Let’s not compound the error by adopting this expensive flop as the basis for centrally imposed, mechanistic teacher evaluation systems nationwide.

(Edited for typos and to add links.  To see a follow-up post, click here.)


19 Responses to Understanding the Gates Foundation’s Measuring Effective Teachers Project

  1. Bravo Jay!!!

    For the Bill and Melinda Gates Foundation:
    Why all the focus on evaluating teachers rather than a major focus on improving instruction?

    As you seem to say —
    If it ain’t centrally imposed top down one size fits none, there is no interest in it.

    John Hattie “Visible Learning for Teachers” makes it very clear that to improve a school and improve instruction requires giving each school’s teachers the power to improve their school. This means teachers establish what needs to change and how to change it. This involves using relevant data to monitor and adjust the components of their improvement plan over time.

    Currently in Education, decision making at the administrative levels is dominated by nonsense.

    Here is some science for BMGF:
    To improve a system requires the intelligent application of relevant data. This project was an expensive flop.

  2. […] At his blog, Dr. Greene fleshes out his objections and points to evidence from the Gates project’s own research to argue that “[c]lassroom observations make virtually no independent contribution to the predictive power o… […]

  3. In the NEA/AFT/AFSCME cartel’s schools (the “public” schools) insiders will bend any system of teacher evaluation to their purposes. You can bet that teacher unions eventually will select classroom evaluators and these evaluators will use vague criteria to reward friends and punish enemies.You will see the old “I’m brave, you’re reckless, he’s foolhardy”, “I’m systematic, you’re unimaginative, he’s rigid”, I’m alert to teachable moments, you’re spontaneous, he’s disorganized” application of double standards.
    Whistleblowers will get railroaded.

    I see two defenses: a) freedom of teachers to transfer between schools within a State and a culture of administrative integrity and (b) policies that give to parents the power to determine which institution shall receive the taxpayers’ K-12 education subsidy ( vouchers, tuition tax credits, subsidized homeschooling, Parent Performance Contracting).

  4. harriettubmanagenda says:

    I see more use for a study that related credential requirements of new hire teachers to student performance. What coursework by new-hire teachers affects student performance? Do Education credits make any difference? Could elementary schools just as effectively hire Child Psychology majors as Elementary Ed majors? Could high schools just as effectively hire History and Chemistry majors as Secondary Social Studies or Secondary Math Ed majors?

  5. Harriet = me, Malcolm. Sorry.

  6. […] says the foundation’s conclusions were based on the politics of convincing teachers and school districts of the merits of evaluations, and not data. He takes particular aim at classroom observations, which he says the Gates shows do not improve […]

  7. […] Understanding the Gates Foundation’s Measuring Effective Teachers Project (jaypgreene.com) […]

  8. […] Understanding the Gates’ Foundation Measures of Effective Teaching project (Jay P Greene blog,… […]

  9. […] and timely feedback to help teachers improve their practice. (If you want to parse this further, Jay P. Greene is sort of right on this but Marty West has a more nuanced […]

  10. […] Understanding the Gates Foundation’s Measuring Effective Teachers Project […]

  11. […] fostered by traditionalist policies and practices. One can simply look at the arguments between Jay P. Greene, Martin West, myself, and others over the Bill & Melinda Gates Foundation’s efforts to […]

  12. […] data, and student surveys — that it is touting. As both University of Arkansas researcher Jay P. Greene and I have continually pointed out, the Gates Foundation’s own data actually proves without […]

  13. […] We have a number of colleagues who participated in this study including my friends and Center for Teaching Quality teacherpreneurs, Ryan Kinser and Megan Allen. What this study is purported to do is identify effective teaching. What the real goal seemed to be, was to identify an effective way of evaluating teachers. Or according to some, including researcher Jay Greene, find the best mix of measures to implement so that teachers would accept student assessment as a valid to measure teacher effectiveness. Jay said, […]

  14. Mark McLaren says:

    Jay, I don’t understand why you so easily dismiss ‘reliability’. Given a choice between two methods of evaluation which have the same predictive power, wouldn’t teachers prefer the one that was more reliable (ie. less variable)?

  15. Mark — you are failing to consider the enormous costs (and hassles) involved in adding the classroom observation. Remember that the modest gain in reliability with virtually no improvement in predictive power took hours of training observers of whom a large portion had to be dismissed. And then each teacher had to be observed multiple times by different external observers. Just imagine what it would take to implement something like this on scale. And for all of this effort and billions of dollars we get what? A measure that is a little more stable but not more predictive.

    • edthinktank says:

      But is not the goal to have petty little dictators dictate? Otherwise we could improve things by using instructional materials and practices that have been proven effective. … But no, instead we get recently produced materials of questionable worth which align to Gates funded Common Core State Standards. … Check the NAEP math data and observe the statistical significance of the increase in students scoring at “below basic” from 2013 to 2015 at 4th & 8th grade levels. … Gates desire to rule from above is an expensive disaster. A fine example of the dominance of Power Politics over rational action.

  16. […] este artículo del blog de Jay P. Green “Entendiendo el proyecto de la fundación Gates para la evaluación efectiva del docente” podemos ver hasta donde llegan las intenciones de ciertos agentes externos para manipular la […]

  17. […] was a monumental waste of money, misinterpreted its own data and couldn’t find any way to identify an effective teacher. And what they came up with would cost $5 billion a year in the USA alone. That’s why people […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s