The Gates Foundation has released the next installment of reports in their Measuring Effective Teachers Project. When the last report was released, I found myself in a tussle with the Gates folks and Sam Dillon at the New York Times because I noted that the study’s results didn’t actually support the finding attributed to it. Vicki Phillips, the education chief at Gates, told the NYT and LA Times that the study showed that “drill and kill” and “teaching to the test” hurt student achievement when the study actually found no such thing.
With the latest round of reports, the Gates folks are back to their old game of spinning their results to push policy recommendations that are actually unsupported by the data. The main message emphasized in the new round of reports is that we need multiple measures of teacher effectiveness, not just value-added measures derived from student test scores, to make reliable and valid predictions about how effective different teachers are at improving student learning.
This is the clear thrust of the newly released Policy and Practice Brief and Research Paper and is obviously what the reporters are being told by the Gates media people. For example, Education Week summarizes the report as follows:
…the study indicates that the gauges that appear to make the most finely grained distinctions of teacher performance are those that incorporate many different types of information, not those that are exclusively based on test scores.
And Ed Sector says:
The findings demonstrate the importance of multiple measures of teacher evaluation: combining observation scores, student achievement gains, and student feedback provided the most reliable and predictive assessment of a teacher’s effectiveness.
But buried away on p. 51 of the Research Paper in Table 16 we see that value-added measures based on student test results — by themselves — are essentially as good or better than the much more expensive and cumbersome method of combining them with student surveys and classroom observations when it comes to predicting the effectiveness of teachers. That is, the new Gates study actually finds that multiple measures are largely a waste of time and money when it comes to predicting the effectiveness of teachers at raising student scores in math and reading.
According to Table 16, student achievement gains correlate with the underlying value-added by teachers at .69. If the test scores are combined (with an equal weighting) with the results of a student survey and classroom observations that rate teachers according to a variety of commonly-used methods, the correlation to underlying value-added drops to be between .57 and .61. That is, combining test scores with other measures where all measures are equally weighted actually reduces reliability.
The researchers also present the results of a criteria weighted combination of student achievement gains, student surveys, and classroom observations based on the regression coefficients of how predictive each is of student learning growth in other sections for the same teacher. Based on this the test score gains are weighted at .729, the student survey at .179, and the classroom observations at .092. This tells us how much more predictive test score gains are than student surveys or classroom observations. Yet even when test score gains constitute 72.9% of the combined measure, the correlation to underlying teacher quality still ranges between .66 and .72, depending on which method is used for rating the classroom observations. The criteria-weighted combined measure provides basically no improvement in reliability over using test score gains by themselves.
And using multiple measures does not improve our ability to distinguish between effective and ineffective teachers. Using test scores alone the difference between the top quartile and bottom quartile teacher in producing student value-added is .24 standard deviations in math learning growth on the state test. If we combine test scores with student surveys and classroom observations using an equal weighting, the difference between top and bottom quartile teachers shrinks to be between .19 and .21. If we use the criteria weights, where test scores are 72.9% of the combined measure, the gap between top and bottom teacher ranges between .22 and .25. In short, using multiple measures does not improve our ability to distinguish between effective and ineffective teachers.
The same basic pattern of results holds true for reading, which can be seen in Table 20 on p. 55 of the report. Combining test score measures of teacher effectiveness with student surveys and classroom observations does improve a little our ability to predict how students would answer survey items about their effort in schools as well as how they felt about their classroom environment. But unlike test scores, which have been shown to be strong predictors of later life outcomes, I have no idea whether these survey items accurately capture what they intend or have any importance for students’ lives.
Adding the student surveys and classroom observation measures to test scores yields almost no benefits, but it adds an enormous amount of cost and effort to a system for measuring teacher effectiveness. To get the classroom observations to be usable, the Gates researchers had to have four independent observations of those classrooms by four separate people. If put into practice in schools that would consume an enormous amount of time and money. In addition, administering, scoring, and combing the student survey also has real costs.
So, why are the Gates folks saying that their research shows the benefits of multiple measures of teacher effectiveness when their research actually suggests virtually no benefits to combining other measures with test scores and when there are significant costs to adding those other measures? The simple answer is politics. Large numbers of educators and a segment of the population find relying solely on test scores for measuring teacher effectiveness to be unpalatable, but they might tolerate a system that combined test scores with classroom observations and other measures. Rather than using their research to explain that these common preferences for multiple measures are inconsistent with the evidence, the Gates folks want to appease this constituency so that they can put a formal system of systematically measuring teacher effectiveness in place. The research is being spun to serve a policy agenda.
This spinning of the findings is not just an accident or the results of a misunderstanding. It is clearly deliberate. Throughout the two reports Gates just released, they regularly engage in the same pattern of presenting the information. They show that the classroom observation measures by themselves have weak reliability and validity in predicting effective teachers. But if you add the student survey and then add the test score measures, you get much better measures of effective teachers. This pattern of presentation suggests the importance of multiple measures, since the classroom observations are strengthened when other measures are added. The only place you find the reliability and validity of test scores by themselves is at the bottom of the Research Paper in Tables 16 and 20. If both the lay-version and technical reports had always shown how little test scores are improved by adding student surveys and classroom observations, it would be plain that test scores alone are just about as good as multiple measures.
The Gates folks never actually inaccurately describe their results (as Vicki Phillips did with the previous report). But they are careful to frame the findings as consistently as possible with the Gates policy agenda of pushing a formal system of measuring teacher effectiveness that involves multiple measures. And it worked, since the reporters are repeating this inaccurate spin of their findings.
(UPDATE — For a post anticipating responses from Gates, see here.)