The Gates Foundation has released the next installment of reports in their Measuring Effective Teachers Project. When the last report was released, I found myself in a tussle with the Gates folks and Sam Dillon at the New York Times because I noted that the study’s results didn’t actually support the finding attributed to it. Vicki Phillips, the education chief at Gates, told the NYT and LA Times that the study showed that “drill and kill” and “teaching to the test” hurt student achievement when the study actually found no such thing.
With the latest round of reports, the Gates folks are back to their old game of spinning their results to push policy recommendations that are actually unsupported by the data. The main message emphasized in the new round of reports is that we need multiple measures of teacher effectiveness, not just value-added measures derived from student test scores, to make reliable and valid predictions about how effective different teachers are at improving student learning.
This is the clear thrust of the newly released Policy and Practice Brief and Research Paper and is obviously what the reporters are being told by the Gates media people. For example, Education Week summarizes the report as follows:
…the study indicates that the gauges that appear to make the most finely grained distinctions of teacher performance are those that incorporate many different types of information, not those that are exclusively based on test scores.
And Ed Sector says:
The findings demonstrate the importance of multiple measures of teacher evaluation: combining observation scores, student achievement gains, and student feedback provided the most reliable and predictive assessment of a teacher’s effectiveness.
But buried away on p. 51 of the Research Paper in Table 16 we see that value-added measures based on student test results — by themselves — are essentially as good or better than the much more expensive and cumbersome method of combining them with student surveys and classroom observations when it comes to predicting the effectiveness of teachers. That is, the new Gates study actually finds that multiple measures are largely a waste of time and money when it comes to predicting the effectiveness of teachers at raising student scores in math and reading.
According to Table 16, student achievement gains correlate with the underlying value-added by teachers at .69. If the test scores are combined (with an equal weighting) with the results of a student survey and classroom observations that rate teachers according to a variety of commonly-used methods, the correlation to underlying value-added drops to be between .57 and .61. That is, combining test scores with other measures where all measures are equally weighted actually reduces reliability.
The researchers also present the results of a criteria weighted combination of student achievement gains, student surveys, and classroom observations based on the regression coefficients of how predictive each is of student learning growth in other sections for the same teacher. Based on this the test score gains are weighted at .729, the student survey at .179, and the classroom observations at .092. This tells us how much more predictive test score gains are than student surveys or classroom observations. Yet even when test score gains constitute 72.9% of the combined measure, the correlation to underlying teacher quality still ranges between .66 and .72, depending on which method is used for rating the classroom observations. The criteria-weighted combined measure provides basically no improvement in reliability over using test score gains by themselves.
And using multiple measures does not improve our ability to distinguish between effective and ineffective teachers. Using test scores alone the difference between the top quartile and bottom quartile teacher in producing student value-added is .24 standard deviations in math learning growth on the state test. If we combine test scores with student surveys and classroom observations using an equal weighting, the difference between top and bottom quartile teachers shrinks to be between .19 and .21. If we use the criteria weights, where test scores are 72.9% of the combined measure, the gap between top and bottom teacher ranges between .22 and .25. In short, using multiple measures does not improve our ability to distinguish between effective and ineffective teachers.
The same basic pattern of results holds true for reading, which can be seen in Table 20 on p. 55 of the report. Combining test score measures of teacher effectiveness with student surveys and classroom observations does improve a little our ability to predict how students would answer survey items about their effort in schools as well as how they felt about their classroom environment. But unlike test scores, which have been shown to be strong predictors of later life outcomes, I have no idea whether these survey items accurately capture what they intend or have any importance for students’ lives.
Adding the student surveys and classroom observation measures to test scores yields almost no benefits, but it adds an enormous amount of cost and effort to a system for measuring teacher effectiveness. To get the classroom observations to be usable, the Gates researchers had to have four independent observations of those classrooms by four separate people. If put into practice in schools that would consume an enormous amount of time and money. In addition, administering, scoring, and combing the student survey also has real costs.
So, why are the Gates folks saying that their research shows the benefits of multiple measures of teacher effectiveness when their research actually suggests virtually no benefits to combining other measures with test scores and when there are significant costs to adding those other measures? The simple answer is politics. Large numbers of educators and a segment of the population find relying solely on test scores for measuring teacher effectiveness to be unpalatable, but they might tolerate a system that combined test scores with classroom observations and other measures. Rather than using their research to explain that these common preferences for multiple measures are inconsistent with the evidence, the Gates folks want to appease this constituency so that they can put a formal system of systematically measuring teacher effectiveness in place. The research is being spun to serve a policy agenda.
This spinning of the findings is not just an accident or the results of a misunderstanding. It is clearly deliberate. Throughout the two reports Gates just released, they regularly engage in the same pattern of presenting the information. They show that the classroom observation measures by themselves have weak reliability and validity in predicting effective teachers. But if you add the student survey and then add the test score measures, you get much better measures of effective teachers. This pattern of presentation suggests the importance of multiple measures, since the classroom observations are strengthened when other measures are added. The only place you find the reliability and validity of test scores by themselves is at the bottom of the Research Paper in Tables 16 and 20. If both the lay-version and technical reports had always shown how little test scores are improved by adding student surveys and classroom observations, it would be plain that test scores alone are just about as good as multiple measures.
The Gates folks never actually inaccurately describe their results (as Vicki Phillips did with the previous report). But they are careful to frame the findings as consistently as possible with the Gates policy agenda of pushing a formal system of measuring teacher effectiveness that involves multiple measures. And it worked, since the reporters are repeating this inaccurate spin of their findings.
(UPDATE — For a post anticipating responses from Gates, see here.)
This is terrifically interesting and frightening in some sense. Thank you. One perhaps silly question — how is an effective teacher defined?
Hi Michael. For the most part they define effective teachers as those who are able to produce gains in student achievement test results.
The assessments are mostly tasks, activities, or projects. The goals are largely behavioral or attitudinal.
Where is the knowledge? Apart from being the source to point to in gaining the state approvals and to use in developing the learning tasks.
What is being “tested”? The current objective tests are going away, even the weak CRCTs.
Effective teacher is defined as implementing the old outcomes based education practitioners handbook measures in the classroom. That’s why Danielson, Pianta, Pecchione, etc are all involved in the measuring effective teacher project. To incorporate CLASS, the Connecticut OBE criteria, and Danielson’s OBE handbook from the 80s.
This is designed to get around the close the door and teach the content that was identified in the Rand Change Agent study from the 1970s as interfering with the 1960s attempt at ed reform. It’s not that multiple measures work better. It’s important to be able to threaten the teachers with loss of their jobs if they refuse to “perform” as mandated.
The 2 metro Atlanta school districts, Cobb and Fulton, that failed to sign up for Race to the Top, in part because the MOU mandated by the state stipulated that all classes and professional development must be conducted pursuant to the Learning Frameworks. This had been an issue in the state’s piloting of Common Core via its Georgia Performance Standards especially with respect to the integrated math. The experience with the task approach meant little transmission of math knowledge and skills and the districts wanted to be free to teach the content in the most productive ways possible.
Both school districts within the year have ended up with new school supers who inexplicably just happened to have come from 2 of the 7 Council of Great City Schools Pilot sites, Charlotte and Dallas, involved in the research for this story.
Education in the US these days is determined to gain submission from schools, teachers, students, and districts one way or another.
I appreciate the insights into the way this research is being spun.
When Melinda Gates was interviewed at the Education Nation Teacher Town Hall, she made it clear that the way the student surveys — and presumably other forms of “multiple measurement” were validated was by the extent to which they correlated with test score gains. I wrote about this at the time here: http://blogs.edweek.org/teachers/living-in-dialogue/2011/09/circular_reasoning_at_the_gate.html
I think your analysis is probably correct. The concept of “multiple measures” is being promoted because it is more palatable than just using test score data. But the goal is to create a mechanical, “data-driven” system for identifying “effective” teachers. Equating good teaching with good test scores is the death of true education.
Thanks for the note. I should clarify that I am not writing all of this about the Gates report because I believe that we should just use test scores to measure teacher effectiveness. My point is simply that the main finding being attributed to this study is not actually supported by the results. And I think the error is attributable to political distortion of the research results.
Unfortunately, this is becoming a pattern with Gates-backed research. In addition to the political distortion of the two MET reports, a bunch of the Common Core research that Gates has backed shows evidence of being made to fit the policy agenda. Given how big Gates is in funding education research, I am finding this pattern of politically-distorted research to be very worrisome.
And, Anthony, you may be surprised to learn that I am also opposed to a mechanical, centrally imposed system for evaluating teachers. There is no “right” way to identify and motivate effective teachers. I’m inclined to believe that when individual schools and principals are properly incentivized, they’ll figure out how best to recruit, retain, motivate, and evaluate their staff.
I find it interesting that you see the idea of multiple measures being promoted merely for political purposes, and to make the idea more “palatable.” The people promoting only value-added measures or test data analysis have no connection whatsoever to education. Those who are based in education realize that students are not simply automatons that can be fed information. I’ll even bust out the cliche of, Teaching is an art, not a science. Schools adopting consulting or education firms “canned” curriculum are doing a terrible dis-service to both teachers and students. (And those firms have aligned the curriculum with standardized tests, and the students HATE it, almost as much as the teachers.) There is no standard way to teach each and every student, and it appears many just cannot accept that reality. Sorry. I want to win the lottery, but I haven’t so I will deal accordingly. Advocates trying to implement a one size fits all curriculum and testing mechanism are deluded and should do the same as my lottery losing self. We are dealing with young people here, not machines. Also, think what education would devolve into if statistical analysis of tests were all that mattered. Already teachers feel pressure to “teach to the test.” Why would teachers “waste their time” with teaching students to think “outside the box” or come up with creative solutions to problems? People like yourself are a threat to the future of our nation.
You mis-read my post. I am not advocating for a system based only on test scores and in the comments I explicitly say so. I am simply for an accurate description of research. And the Gates research found that combining measures did not improve predictive power while costing a ton of money. That doesn’t mean we should go ahead with a test-only system. But we shouldn’t mis-lead people and tell them that the combined measure is supported by this research.
Well, hallelujah for a point of agreement. Shall we inform Mr. Russo?
After you caught them misrepresenting the research last time, the media stopped picking up the Gates spin. Most reporters don’t appreciate being used as tools (although some enjoy the proximity to power it creates – as you found out last time). It will be interesting to see if the same happens this time.
Since Gates is so heavily invested in the idea that scientific research can determine correct policy, perhaps you and I could co-author a study on the relative cost-effectiveness of the Gates approach versus our approach in influencing policy. They’ve spent how many millions on this, and you and I and a few other folks have pitched in what, the domain name fee and some in kind man-hours? And yet you’re like their Belloc. “Once again, Mr. Gates, we see there is no media hit you can possess which I cannot take away.”
Do you hypothesize the low correlation of observations is a likely ceiling going forward, since several were tried?
Or would you guess that the observation protocols look at the wrong stuff, setting up the chance of a Moneyball-like breakthrough, where entirely different folks (maybe amateurs) figure out what really matters in an observation?
I think there is a distinct possibility that Gates just messed up the implementation of the classroom observations somehow. Another study by Rockoff and Speroni (see http://blogs.edweek.org/edweek/teacherbeat/10%2017%20rockoff_speroni_labour_econ_published.pdf ) found that the classroom observations independently added a substantial amount to the predictive power of student achievement scores. Perhaps the Rockoff and Speroni study looked at an unusually good classroom observation system. Perhaps the disappointing Gates classroom observation results are more generally representative. All I know is that the Gates study didn’t support the main conclusion that they and others are attributing to it.
Thanks Jay. I will read the Rockoff/Speroni study.
I am hesitant to stick my head into the guillotine of socio intellectual theorizing, but aren’t we missing the core issue completely here. That being the fact that our youth are being exposed to the politically correct agenda of a profession progressing (?) from a humanitarian art into an instrument of suspect purpose for the utility of a vaguely defined movement of masses of literal worth into a veiled area of exclusivity of the highest order.
It’s disingenuous to ignore, or dodge, the fact that the very people we need to be concerned about are mentioned mostly in the sense of an almost afterthought. Our children. How can we continue to ignore the reality of the falling levels of academic attainment of every generation of students since the early sixties, while we numbly debate the state of non teaching as if the teachers themselves were the primary concern?
Pop philosophy, which is the root of the stinking tree from which this polemic monstrosity grows, as entertaining as it may be, is the horrid haunt of self centered egos fanning their peacock tails of pseudo wisdom for their own aggrandizement while the innocent subjects we submit to their care waste their most informative years struggling to accommodate the brainfarts of one foolish experiment after another in how best to covertly indoctrinate society’s most susceptible fraction.
It’s nearly nauseating to read the list of “scholars” and “studies” being quoted in this discussion, while ignoring their failures to achieve anything even resembling the results painstakingly acquired by the trial and error of sixteen generations of real American teachers. Has it not occurred to anybody to say hey, wait a minute, how could the seventh graders of fifty years ago outscore todays highschool graduates in English, in Math, in World History, in Geography, in languages!
I’ve read papers from those years submitted by real students in real classrooms, under the tutelage of real teachers, and have been floored by the literacy, the efforts at precision, the degree of relative subject difficulty, (relative to today) and the obvious and outright achievement reached by thirteen year olds that could actually intimidate some of todays very teachers.
When we allow ourselves to be a reflection of academia, instead of the other way round, we will by any intelligent subjective measure inexorably revert to our lowest common denominator, it’s inevitable, it’s a simple linear inevitability.
In short, we are a poor substitute for the studied excellence of what the teaching profession used to be, the same profession that once made our achievements in so many fields of expertise the envy of the entire educated world. And we will not regain that leadership under the questionable auspices of whatever Gates is fostering. We will only recover if we reopen our eyes to the fact that we are in an actual state of educational regression and act accordingly, and let the Goddamned social bullchips fall wherever they may.
Bitter, me, nah, I kind of passed that low level passion some time ago. I’ve now reached the stage along with most other Americans of wanting to go and urinate on the carpets our dollars have put down in the livingrooms of the “experts” this forum is so quick to lionize. It would freshen up their whole house I’m quite sure.
I am going to use multiple measures to evaluate the accuracy of this post.
The fact that “student achievement gains”, by themselves, are highly predictive of “underlying teacher value-added” would seem to be tautological, since teacher value-added is calculated primarily based on, you guessed it, student achievement gains. How did the researchers measure teacher value-added distinct from student achievement gains? We really need to know that in order for your critique to make any sense.
They tested the predictive power of student achievement gains by using the previous year to predict the next one or by using one section (like 3rd period) to predict another (like 4th period).
Minnesota Kid hits the nail squarely on the head. As I pointed out in the piece I wrote about what Melinda Gates said at Education Nation, the entire value-added project assumes that student achievement as indicated by test scores is synonymous with student learning. This then becomes circular when “multiple measures” are actually other indicators that likewise correlate with test score gains. These are NOT truly multiple measures of good teaching. They are simply multiple ways of measuring the ability to generate good test scores. And everyone SHOULD realize that there are good ways to teach that result in better test scores, but also very bad ways to teach that likewise raise scores. Defining good teaching on the basis of test scores — directly or indirectly — does nothing to differentiate between these two alternatives.
To really judge the quality of teaching, we need a much more nuanced and robust set of indicators — along the lines developed by the National Board of Professional Teaching Standards almost two decades ago.
Thanks Anthony. That last paragraph is quite helpful.
You look at the question of whether classroom observations are predictive. What is your view on whether classroom observations help teachers become better teachers? If observations and feedback help them become better teachers does that then raise future test scores and value added?
Hi Evan — The Gates study produced no evidence that classroom observations could be helpful in providing feedback to teachers about what to do to improve student learning gains. See https://jaypgreene.com/2012/01/09/anticipating-responses-from-gates/ . I’m not saying that classroom observations couldn’t ever provide helpful feedback. I’m just saying that the classroom observation in the Gates study wasn’t helpful.
I think it is interesting how the debate slides back into an almost unconscious acceptance of the definition of “useful” and “effective.” These terms are being defined solely upon the basis of effect on test scores. “Student learning gains” is a fancy way to describe standardized test scores. And that is what teacher quality has been reduced to here. Every other indicator is being vetted by its alignment to the capacity to raise test scores. What if some other dimension increased a student’s capacity to think critically — but critical thinking was not measured by the test? What if a teacher was able to make a student feel part of a classroom community, and that student became accepted by peers for the first time in his life? What if a teacher tapped into a student’s talent as an artist, as I did more than once, and got that student engaged in school and science in a new way? If these acts are not reflected in student test scores, they are never considered by the great metricians who are erecting these mighty evaluative mechanisms. Once we equate “student learning” with “test score gains” we have lost at least half of what it means to be a good teacher.
What’s your definition of critical thinking?
So often it is assumed to be a synonym for analytical thinking but much of the reform documents I have read want the casual reader to assume that meaning. A close reading though makes it clear that many use it as an awareness of social injustices.
How would you define it for purposes of assessing effective teaching?
Also when you speak of engagement, aren’t you pushing more of an emotional interest than cognitive?
In a heterogeneous classroom, doesn’t that mean the not interested in much student dictates what will be accessible to anyone? Hurting the bright students with good attention spans the most.
I would define critical thinking as that which engages the student in thinking and problem solving. As a science teacher, I want my students to wrestle with understanding scientific concepts, but beyond that I want them to be able to analyze an experiment. Do they have a good experimental design? Have they looked at the variables that might affect the outcome and isolated them in their experiment? Can they explain their results clearly? Can they integrate their observations into a working model that relates to what other scientists have discovered?
Sometimes social injustices may provide us with an avenue through which we might engage students who feel disenfranchised, but that is not what I meant by critical thinking.
In terms of heterogeneous classrooms, my experience has been that when I challenged my students to think for themselves, the highly motivated students are the first to appreciate this opportunity. Ideally, their less motivated peers begin to catch the spirit as well.
I think critical thinking is simply linking, making connections, synthesizing from the given information in overlapping or even isolated ares. When you can do that, you’re able to do everything Cody talks about.
Also, what good does it do for a middle or high school teacher to become an elementary principal and observe classrooms. They often do not know what to look for. Sorry. The system is upside down and very, very broken at the bureaucratic and administrative level. Including Gates – another know it all because he’s rich. Well, Microsoft hasn’t kept up the innovation, the creativity or the grand future it seemed to promise. Steve Jobs was better at that. Even George Lucas is doing more for education than Gates who has become another blue suit.
[…] How the Gates Foundation Spins its Research is by Jay Greene. I’m particularly impressed by the dialogue in the comments. […]
I have recently been presenting the MET findings in RTTT workshops with school leaders in Upstate New York. Your article is very interesting and appreciated.
[…] How the Gates Foundation Spins its Research is by Jay Greene. I’m particularly impressed by the dialogue in the comments. […]
I ddo nnot even know how I eded up here, but I thought tjis
post was great. I do not know who you are bbut certaiinly you are going
to a famous blolgger if yyou aren’t already 😉 Cheers!