What Success Would Have Looked Like

Yesterday I described the Gates Foundation’s Measuring Effective Teachers (MET) project as “an expensive flop.”  To grasp just what a flop the project was, it’s important to consider what success would have looked like.  If the project had produced what Gates was hoping, it would have found that classroom observations were strong, independent predictors of other measures of effective teaching, like student test score gains.  Even better, they were hoping that the combination of classroom observations, student surveys, and previous test score gains would be a much better predictor of future test score gains (or of future classroom observations) than any one of those measures alone.  Unfortunately, MET failed to find anything like this.

If MET had found classroom observations to be strong predictors of other indicators of effective teaching and if the combination of measures were a significantly better predictor than any one measure alone, then Gates could have offered evidence for the merits of a particular mixing formula or range of mixing formulas for evaluating teachers.  That evidence could have been used to good effect to shape teacher evaluation systems in Chicago, LA, and everywhere else.

They also could have genuinely reassured teachers anxious about the use of test score gains in teacher evaluations.  MET could have allayed those concerns by telling teachers that test score gains produce information that is generally similar to what is learned from well-conducted classroom observations, so there is no reason to oppose one and support the other.  What’s more, significantly improved predictive power from a mixture of classroom observations with test score gains could have made the case for why we need both.

MET was also supposed to have helped us adjudicate among several commonly used rubrics for classroom observations so that we would have solid evidence for preferring one approach over another.  Because MET found that classroom observations in general are barely related to other indicators of teacher effectiveness, the study told us almost nothing about the criteria we should use in classroom observations.

In addition, the classroom observation study was supposed to help us identify the essential components of effective teaching .  That knowledge could have informed improved teacher training and professional development.  But because MET was a flop (because classroom observations barely correlate with other indicators of teacher effectiveness and fail to improve the predictive power of a combined measure), we haven’t learned much of anything about the practices that are associated with effective teaching.  If we can’t connect classroom observations with effective teaching in general, we certainly can’t say much about the particular aspects of teaching that were observed that most contributed to effective teaching.

Just so you know that I’m not falsely attributing to MET these goals that failed to be realized, look at this interview from 2011 of Bill Gates by Jason Riley in the Wall Street Journal.  You’ll clearly see that Bill Gates was hoping that MET would do what I described above.  It failed to do so.  Here is what the interview revealed about the goals of MET:

Of late, the foundation has been working on a personnel system that can reliably measure teacher effectiveness. Teachers have long been shown to influence students’ education more than any other school factor, including class size and per-pupil spending. So the objective is to determine scientifically what a good instructor does.

“We all know that there are these exemplars who can take the toughest students, and they’ll teach them two-and-a-half years of math in a single year,” he says. “Well, I’m enough of a scientist to want to say, ‘What is it about a great teacher? Is it their ability to calm down the classroom or to make the subject interesting? Do they give good problems and understand confusion? Are they good with kids who are behind? Are they good with kids who are ahead?’

“I watched the movies. I saw ‘To Sir, With Love,'” he chuckles, recounting the 1967 classic in which Sidney Poitier plays an idealistic teacher who wins over students at a roughhouse London school. “But they didn’t really explain what he was doing right. I can’t create a personnel system where I say, ‘Go watch this movie and be like him.'”

Instead, the Gates Foundation’s five-year, $335-million project examines whether aspects of effective teaching—classroom management, clear objectives, diagnosing and correcting common student errors—can be systematically measured. The effort involves collecting and studying videos of more than 13,000 lessons taught by 3,000 elementary school teachers in seven urban school districts.

“We’re taking these tapes and we’re looking at how quickly a class gets focused on the subject, how engaged the kids are, who’s wiggling their feet, who’s looking away,” says Mr. Gates. The researchers are also asking students what works in the classroom and trying to determine the usefulness of their feedback.

Mr. Gates hopes that the project earns buy-in from teachers, which he describes as key to long-term reform. “Our dream is that in the sample districts, a high percentage of the teachers determine that this made them better at their jobs.” He’s aware, though, that he’ll have a tough sell with teachers unions, which give lip service to more-stringent teacher evaluations but prefer existing pay and promotion schemes based on seniority—even though they often end up matching the least experienced teachers with the most challenging students.

The final MET reports produced virtually nothing that addressed these stated goals.  But in Orwellian fashion, the Gates folks have declared the project to be a great success.  I never expected MET to work because I suspect that effective teaching is too heterogeneous to be captured well by a single formula.  There is no recipe for effective teaching because kids and their needs are too varied, teachers and their abilities are too varied, and the proper matching of student needs and teacher abilities can be accomplished in many different ways.  But this is just my suspicion.  I can’t blame the Gates Foundation for trying to discover the secret sauce of effective teaching, but I can blame them for refusing to admit that they failed to find it.  Even worse, I blame them for distorting, exaggerating, and spinning what they did find.

(edited for typos)

13 Responses to What Success Would Have Looked Like

  1. Jay-my analysis of the MET reports and dealings with several district supers who came from MET districts is the purpose of the report was to force teachers to use Charlotte Danielson’s OBE oriented practices in the classroom or be deemed ineffective. The Danielson also aligns with Linda Darling-Hammond’s recent SCOPE report on Teacher Effectiveness and the INTasc Model Teaching Standards.

    One of the greatest snow jobs in education right now is anyone saying that Common Core is not telling teachers how to teach. Changing the type of pedagogy used in the classroom and getting performance assessments instead of tests of knowledge are the real purpose of the Common Core campaign. As the Hewlett Foundation for one has explicitly acknowledged.

  2. Mike G says:

    Jay, what in your view would be a successful “evaluation rubric correlation” with VAM?

    Ie, I think MET found something in the 0.2 range, depending on details.

    What would you consider “fairly good?”

  3. Mark D says:

    I was interested to note this finding in the ‘Gathering Feedback’ report (page 30): for all five classroom observation tools the team analyzed, ‘the first principal component was simply an equal-weighted average of all competencies.’

    When shorn of its statistical weight, this statement is saying that the tools do not tell us what makes good teachers good, which was one of the aims of the MET effort. Good teachers are good at everything.

    The report goes on to say the second component is about classroom and student behavior management, which will surprise nobody. But the report notes that the first component accounted for much of the variation in scores, so this second one is dominated by the first.

    I don’t subscribe to conspiracy theories about the Foundation spending enormous amounts so that particular individuals or organizations benefit. But the observation tools are being handled here with kid gloves for unclear reasons.

  4. Hi Mike G — Correlations of .2 describe very weak relationships. Correlations around .4 or .5 begin to be more serious.

    But Mark D makes an excellent point. For classroom observations to be used to identify more effective teaching practices, we need to see a stronger relationship between classroom observations and test score gains, but we also need to see that some sub-components of the classroom observation are much stronger predictors than others. If an equal weighting of all components in the classroom observation is best, then we don’t know which particular things a teacher should do to be better. Good teachers would then just be good at everything observed and our only advice to less good teachers would be to tell them to do everything better. We couldn’t tell them which things to emphasize to be better.

  5. Mike G says:

    Good point Mark D. So what’s your take, in the end, on the observation tools?

  6. Mark D says:

    I posted this comment on Education Next about observations. I admit that taking the issue back to the drawing board sounds weak.

    “The focus on regression relationships between test scores and observation scores overlooks another dimension of observation scores. They reflect other aspects of what educators might think of as ‘good teaching.’ Even if observation scores are completely uncorrelated with test score gains, parents and communities might prefer that teachers score well on them. It is evidence that ‘good teaching’ is happening by the criteria used to create the observational tool. Or, put another way, if teachers were generating high test score gains from their students by creating a climate of abject fear in their classrooms, their observation scores should be low and that information is useful.

    This is not to defend or criticize the MET results, but to point out that MET study tried to validate observation scores by examining how well they predicted test scores, which equates teaching with test scores. Not finding much correlation beyond what pre-tests already tell us, we could dismiss observations as pointless and expensive as Jay has argued here. But test score gains plus observation scores might be a better look at ‘good teaching.’ It takes us back to the starting point of trying to define the outcome we want, but just because the MET study did not validate observation scores should not mean that teaching and test scores now are equivalent.”

    And, I should add, cost-effectiveness is not often remarked on and I’m glad Jay brought it up. The question is how much we are willing to pay for the added information about teaching that observations yield. A billion a year is nontrivial, and test scores already are being gathered so costs of observation are being added to costs of testing.

    • This is a good point. You should also submit to Ed Next the comment about the equal weighting of classroom observation sub-components. Observations can’t be used for diagnostic purposes to improve teaching practice if all we can tell people is that they should be better across the board.

  7. jimwis says:

    Unfortunately, the author of this blog fails to mention that the Gates study relies on score gains on standardized tests to compare to other measures in order to test for reliability. With only a handful of classroom observations, its absurd to think that we can figure out what makes a teacher tick that ‘grows’ standardized test scores from year to year. In fact, having taught for decades it has become abundantly clear that the teachers who increase test scores may have unethical access to the tests themselves and have the ability to coach and prep their kids. I have seen this too many times, and I will spare the details in order to not tempt any wondering eyes out there about how its done. This reliance on test scores, like their prophetic predictors of teacher quality, is ridiculous. Even when teachers don’t cheat (i.e….teach to the test, coach, prep, etc…), in many cases test scores can vary simply because of the placement of students. I have noticed that when I am luckily assigned a group of students in a class that work together well, and they have the ability to work well in groups and teach each other, then those kids do better on the test. This is no credit to my service as a teacher – I just got lucky and ended up with an extraordinary group of kids and the crackheads were luckily assigned to the teacher down the hall. This love affair with testing outcomes must stop!

  8. Mark D says:

    Any discussion ridiculing the ‘love affair with testing outcomes’ should describe the alternative being proposed in its place. The status quo is not proving adequate for improving education outcomes, and I am not referring just to scores; dropout rates have not improved in the last forty years. Policymakers representing the public have spoken clearly though legislation that education is not going back to its old ways of rating teachers, where 98 percent are deemed to perform satisfactorily. However, jimwis’s point that events in schools can be confused with a teacher’s actual effect on his or her students’ test scores is a useful caution. There is a difficult measurement challenge here.

  9. Frank says:

    Is anyone noting that many teachers are only observed one or two times per year (with one or both observations being announced – or known to the teacher in advance)? Are any of the districts in this study doing sustained observations of the teachers as opposed to the normal practice of once or twice? I don’t know the answer – that’s why I am asking.

  10. […] What Success Would Have Looked Like (jaypgreene.com) […]

Leave a comment