
(Guest Post by Matthew Ladner)
GI’s Butcher debated Andrew Morrill, President of the Arizona Education Association on Education Savings Accounts. Check it out.

(Guest Post by Matthew Ladner)
GI’s Butcher debated Andrew Morrill, President of the Arizona Education Association on Education Savings Accounts. Check it out.

It’s really frustrating, but some reporters continue to mis-represent the scholarly literature on the effects of private school choice programs. We devoted an entire chapter in Education Myths to debunking “The Inconclusive Research Myth.” But like an un-dead vampire that won’t die even after you’ve driven a stake through it’s heart, reporters keep repeating as fact things like the following:
Studies have generally found no clear advantage in academic achievement for students attending private
schools with vouchers.
That statement was the conclusion of the famously unreliable and partisan Center on Education Policy. And reporter Tom Toch embraced it as an accurate summary of voucher research in his recent article in the Kappan. What do we have to do to stop reporters from repeating this falsehood?
This blog post from Adam Emerson at the newly launched Fordham blog, Choice Words, is a great start. Here’s a taste:
School voucher critics generally approach their job reviewing the research on school choice with unfair assumptions, and otherwise insightful commentators risk recycling old canards. This is true with Thomas Toch’s critique of vouchers in the newest edition of Kappan, which concludes that voucher programs haven’t shown enough impact to justify their position in a large-scale reform effort. Questions of scale can lead to legitimate debate, but we’ll get nowhere until we acknowledge what’s in the literature.
And Adam doesn’t even reference all of the gold standard (random assignment) research showing positive effects for students who participate in voucher programs, not to mention all of the rigorous studies finding that entire school systems improve in response to vouchers.
So why do people like Tom Toch, who’s not stupid or mean, fail to acknowledge this wealth of evidence showing benefits from voucher programs and just focus on crappy and mistaken summaries from hacks at CEP?

(Guest Post by Matthew Ladner)
A few months ago, I provided a quick analysis of DCPS NAEP scores under Michelle Rhee. Having looked into the fine details, I believe that I underestimated the positive trend in DCPS reading scores during the 2007-2011 period.
NAEP has long dealt with a tricky issue with varying inclusion rates for special education and English language learners between jurisdictions. In 2011, the NAEP adopted inclusion rate standards for ELL and SD students, and notified readers of jurisdictions that violated those standards in an appendix.
Some states and jurisdictions had far more successful efforts to comply with these efforts than others. As you can see from the figure below, DC would have been far out of compliance with these standards (had they been in place) during the 1990s and (especially) in 2007. In 2007, DCPS had excluded nearly three times as many students as permissible under the 2011 standards.
So in 2007, DCPS officials excluded 14% of students from 4th Grade NAEP testing, and in 2011 that figure fell to 3% (the inclusion for all students standard in 2011 was 95%). In 2007, DCPS stood far out of compliance, but came well within compliance in 2011. This is all well and fine, other than the fact that it complicates our ability to assess the recent history of DC NAEP gains.
In order to get a clearer picture on this, I decided to run 4th Grade NAEP scores for students outside of ELL or special education programs. This should minimize the impact of inclusion policy changes. Examined in this fashion, you get the following results:
Recall that the unadjusted total scores for 4th grade reading jumped from 197 in 2007 to 202 in 2009 but dropped back a point to 201 in 2011. That is a four point gain in four years, which ranks in meh territory. Given Figure 1 above, I am not exactly inclined to trust those scores, and in fact out second table tells quite a different story: general education students in DC made a 10 point gain between 2007 and 2011 on 4th grade reading. Ten points approximately equals a grade level worth of progress, so it is fair to say that DCPS general education 4th graders were reading approximately as well as 2007 general education 5th graders. Ten points ranks as the largest reading gain in the nation during this period for these students. Mind you, a 209 score for non-Ell and non-special ed students is still terribly low. Only gains will get DC out of the cellar, however, and DC banked solid gains during this period.
If you combine 4th and 8th grade reading gains for general education students, and only look at Free and Reduced lunch eligible students for a bit of socio-economic apples to apples, here is what you find:
DC students had the largest general education 4th grade reading gains in the country, and tie for first in the combined 4th and 8th grade reading gains. The District of Columbia, in short, made very substantial reading gains during the 2007-2011 period.
(Guest post by Greg Forster)
Since we’re so deep into the subject of value-added testing and the political pressures surrounding it, I thought I’d point out this recently published study tracking two and a half million students from a major urban district all the way to adulthood. (HT Whitney Tilson)
They compare teacher-specific value added on math and English scores with eventual life outcomes, and apply tests to determine whether the results are biased either by student sorting on observable variables (the life outcomes of their parents, obtained from the same life-outcome data) or unobserved variables (they use teacher switches to create a quasi-experimental approach).
Finding?
Students assigned to high-VA teachers [i.e. teachers who produce high “value added” on test scores] are more likely to attend college, attend higher- ranked colleges, earn higher salaries, live in higher SES neighborhoods, and save more for retirement. They are also less likely to have children as teenagers. Teachers have large impacts in all grades from 4 to 8.
Let’s bring that down to reality:
Replacing a teacher whose VA is in the bottom 5% with an average teacher would increase students’ lifetime income by more than $250,000 for the average classroom in our sample.
But here’s what I want to pay the most attention to. Note the careful wording of the conclusion:
We conclude that good teachers create substantial economic value and that test score impacts are helpful in identifying such teachers.
Note what they don’t say. They don’t say that increasing math and English test scores by itself leads to improved life outcomes. They say good teachers lead to improved life outcomes, and value-add is one relatively good way to identify good teachers.
You’ve heard the saying that the map is not the territory? (If not, that means you haven’t seen Ronin, in which case shame on you.) Well, it’s true. What raises life outcomes is good teaching, and good teaching can’t be reduced to test scores. (See here, here, here, here, here, here, here and here.)
But if you want to find your way around the territory, you need a map. If you want to help those kids stuck with lousy teachers who are out a quarter million, you’re going to need a tool that identifies them. Value added analysis is the best tool we’ve come up with yet – other than parental choice, of course.
And where the tests are freely selected and voluntarily adopted by schools, the tests provide helpful data for parents, so parent choice is strengthened by voluntary testing. That’s why over 90% of private schools use testing in some form. On the other hand, forcing teachers to use a test they don’t believe in is a self-defeating proposal.
But how do you get schools to want to use a test? Parent choice, of course! Choice is what creates the external standard of performance that makes assessment tools seem legitimate rather than illegitimate. So testing and choice are like chocholate and peanut butter – they’re two great tastes that taste great together.
Over the weekend I posted about how I thought the Gates Foundation was spinning the results of their Measuring Effective Teachers Project to suggest that the combination of student achievement gains, student surveys, and classroom observations was the best way to have a predictive measure of teacher effectiveness. Let me anticipate some of the responses they may have:
1) They might say that they clearly admit the limitations of classroom observations and therefore are not guilty of spinning the results to inflate their importance. They could point to p. 15 of the research paper in which they write: “When value-added data are available, classroom observations add little to the ability to predict value-added gains with other groups of students. Moreover, classroom observations are less reliable than student feedback, unless many different observations are added together.”
Response: I said in my post over the weekend that the Gates folks were careful so that nothing in the reports is technically incorrect. The distortion of their findings comes from the emphasis and manner of presentation. For example, the summary of findings in the research paper on p. 9 states: “Combining observation scores with evidence of student achievement gains and student feedback improved predictive power and reliability.” Or the “key findings” in the practitioner brief on p. 5 say: “”Observations alone, even when scores from multiple observations were averaged together, were not as reliable or predictive of a teacher’s student achievement gains with another group of students as a measure that combined observations with student feedback and achievement gains on state tests.” Notice that these summaries of the results fail to mention the most straightforward and obvious finding: classroom observations are really expensive and cumbersome and yet do almost nothing to improve the predictiveness of student achievement-based measures of teacher quality.
And the proof that the results are being spun is that the media coverage uniformly repeats the incorrect claim that multiple measures are an important improvement on test scores alone. Either all of the reporters are lousy and don’t understand the reports or the reporters are accurately repeating what they are being told and what they overwhelmingly see in the reports. My money is on the latter explanation.
And further proof that the reporters are being spun is that Vicki Phillips, the Gates education chief, is quoted in the LA Times coverage mis-characterizing the findings: “Using these methods to evaluate teachers is ‘more predictive and powerful in combination than anything we have used as a proxy in the past,’ said Vicki Phillips, who directs the Gates project.” This is just wrong. As I pointed out in my previous post, the combined measure is no more predictive than student achievement by itself.
Lastly, the standard for fair and accurate reporting of results is not whether one could find any way to show that technically the description of findings is not false. We should expect the most straightforward and obvious description of findings emphasized. With the Gates folks I feel like I am repeatedly parsing what the meaning of the word “is” is. That’s political spin, not research.
2) They might say that classroom observations are an important addition because at least they provide diagnostic information about how teachers can improve, while test scores cannot.
Response: This may be true, but it is not a claim supported by the Gates study. They found that all of the different classroom observation methods they tried had very weak predictive power. You can’t provide a lot of feedback about how to improve student achievement based on instruments that are barely correlated with gains in student achievement. In addition, they were unable to find sub-components of the classroom observation methods that were more predictive, so they can’t tell teachers that they really need to do certain things, since those things are much more strongly related to student learning gains. Lastly, it is simply untrue that test scores cannot be diagnostic. There are sub-components of the tests that measure learning in different aspects of the subject. Teachers could be told to emphasize more those areas on which their students have lagged.
3) They may say that classroom observations and students surveys improve the reliability of a teacher quality measure when combined with test scores.
Response: An increase in reliability is cold comfort for a lack of predictive power. Reliability is just an indicator of how consistent a measure is. There are plenty of measures that are very consistent but not helpful in predicting teacher quality. For example, if we asked students to rate how attractive their teacher was, we would probably get a very “reliable” (consistent) measure from year to year and section to section. But that consistency would not make up for the fact that attractiveness is unlikely to help improve the prediction of effective teaching. So, the student survey has a high amount of consistency, but who knows what that is really measuring since it is only weakly related to student learning gains. It is consistent, but consistently wrong. Our focus should be on the predictive power of teacher evaluations and classrooms observations and student surveys don’t really do anything to help with that (at least, not according to the Gates study).
4) They may say that classroom observations and student surveys improve on the prediction of student effort and classroom environment.
Response: As I mentioned in the post over the weekend, they don’t really have validated measures of student effort and classroom environment. The Gates folks took a lot of flack last year for focusing on test-score gains, so they came up with some non-test score outcome measures simply by taking some of the items from the students survey where students are asked about their effort or classroom environment. We have no idea whether they have really measured the amount of effort students exert or the quality of the classroom environment, they are just using some survey answers on those items and claiming that they have measured those “outcomes.” The only validated outcome measure we have in the Gates study are the test score gains, so we have to focus on that.
—————————————————————————————————
The good news is that my fears about the Gates study being used to dictate what teachers do have not been realized, at least not yet. But it wasn’t for lack of trying. If the classroom observations had worked a little better in predicting student learning gains, I’m sure we would have heard about how teachers should run their classrooms to produce greater gains. But the classroom observations were so much of a dud that gates education chief, Vicki Phillips, didn’t even attempt to claim that they found that drill and kill is bad or that teachers should avoid teaching to the test.
But the inability to use the classroom observations to tell teachers the “right” way of teaching is another way of saying that the classroom observations are not able to be used for diagnostic purposes. The most straightforward reading of the Gates results is that classroom observations appear to be an expensive and ineffective dud. But it’s hard for an organization that spends $45 million on a project to scientifically validate classroom observations to admit that it failed. It’s hard enough for a third-party evaluator to say that, let alone an in-house study about a key aspect of the Gates policy agenda.

The Gates Foundation has released the next installment of reports in their Measuring Effective Teachers Project. When the last report was released, I found myself in a tussle with the Gates folks and Sam Dillon at the New York Times because I noted that the study’s results didn’t actually support the finding attributed to it. Vicki Phillips, the education chief at Gates, told the NYT and LA Times that the study showed that “drill and kill” and “teaching to the test” hurt student achievement when the study actually found no such thing.
With the latest round of reports, the Gates folks are back to their old game of spinning their results to push policy recommendations that are actually unsupported by the data. The main message emphasized in the new round of reports is that we need multiple measures of teacher effectiveness, not just value-added measures derived from student test scores, to make reliable and valid predictions about how effective different teachers are at improving student learning.
This is the clear thrust of the newly released Policy and Practice Brief and Research Paper and is obviously what the reporters are being told by the Gates media people. For example, Education Week summarizes the report as follows:
…the study indicates that the gauges that appear to make the most finely grained distinctions of teacher performance are those that incorporate many different types of information, not those that are exclusively based on test scores.
And Ed Sector says:
The findings demonstrate the importance of multiple measures of teacher evaluation: combining observation scores, student achievement gains, and student feedback provided the most reliable and predictive assessment of a teacher’s effectiveness.
But buried away on p. 51 of the Research Paper in Table 16 we see that value-added measures based on student test results — by themselves — are essentially as good or better than the much more expensive and cumbersome method of combining them with student surveys and classroom observations when it comes to predicting the effectiveness of teachers. That is, the new Gates study actually finds that multiple measures are largely a waste of time and money when it comes to predicting the effectiveness of teachers at raising student scores in math and reading.
According to Table 16, student achievement gains correlate with the underlying value-added by teachers at .69. If the test scores are combined (with an equal weighting) with the results of a student survey and classroom observations that rate teachers according to a variety of commonly-used methods, the correlation to underlying value-added drops to be between .57 and .61. That is, combining test scores with other measures where all measures are equally weighted actually reduces reliability.
The researchers also present the results of a criteria weighted combination of student achievement gains, student surveys, and classroom observations based on the regression coefficients of how predictive each is of student learning growth in other sections for the same teacher. Based on this the test score gains are weighted at .729, the student survey at .179, and the classroom observations at .092. This tells us how much more predictive test score gains are than student surveys or classroom observations. Yet even when test score gains constitute 72.9% of the combined measure, the correlation to underlying teacher quality still ranges between .66 and .72, depending on which method is used for rating the classroom observations. The criteria-weighted combined measure provides basically no improvement in reliability over using test score gains by themselves.
And using multiple measures does not improve our ability to distinguish between effective and ineffective teachers. Using test scores alone the difference between the top quartile and bottom quartile teacher in producing student value-added is .24 standard deviations in math learning growth on the state test. If we combine test scores with student surveys and classroom observations using an equal weighting, the difference between top and bottom quartile teachers shrinks to be between .19 and .21. If we use the criteria weights, where test scores are 72.9% of the combined measure, the gap between top and bottom teacher ranges between .22 and .25. In short, using multiple measures does not improve our ability to distinguish between effective and ineffective teachers.
The same basic pattern of results holds true for reading, which can be seen in Table 20 on p. 55 of the report. Combining test score measures of teacher effectiveness with student surveys and classroom observations does improve a little our ability to predict how students would answer survey items about their effort in schools as well as how they felt about their classroom environment. But unlike test scores, which have been shown to be strong predictors of later life outcomes, I have no idea whether these survey items accurately capture what they intend or have any importance for students’ lives.
Adding the student surveys and classroom observation measures to test scores yields almost no benefits, but it adds an enormous amount of cost and effort to a system for measuring teacher effectiveness. To get the classroom observations to be usable, the Gates researchers had to have four independent observations of those classrooms by four separate people. If put into practice in schools that would consume an enormous amount of time and money. In addition, administering, scoring, and combing the student survey also has real costs.
So, why are the Gates folks saying that their research shows the benefits of multiple measures of teacher effectiveness when their research actually suggests virtually no benefits to combining other measures with test scores and when there are significant costs to adding those other measures? The simple answer is politics. Large numbers of educators and a segment of the population find relying solely on test scores for measuring teacher effectiveness to be unpalatable, but they might tolerate a system that combined test scores with classroom observations and other measures. Rather than using their research to explain that these common preferences for multiple measures are inconsistent with the evidence, the Gates folks want to appease this constituency so that they can put a formal system of systematically measuring teacher effectiveness in place. The research is being spun to serve a policy agenda.
This spinning of the findings is not just an accident or the results of a misunderstanding. It is clearly deliberate. Throughout the two reports Gates just released, they regularly engage in the same pattern of presenting the information. They show that the classroom observation measures by themselves have weak reliability and validity in predicting effective teachers. But if you add the student survey and then add the test score measures, you get much better measures of effective teachers. This pattern of presentation suggests the importance of multiple measures, since the classroom observations are strengthened when other measures are added. The only place you find the reliability and validity of test scores by themselves is at the bottom of the Research Paper in Tables 16 and 20. If both the lay-version and technical reports had always shown how little test scores are improved by adding student surveys and classroom observations, it would be plain that test scores alone are just about as good as multiple measures.
The Gates folks never actually inaccurately describe their results (as Vicki Phillips did with the previous report). But they are careful to frame the findings as consistently as possible with the Gates policy agenda of pushing a formal system of measuring teacher effectiveness that involves multiple measures. And it worked, since the reporters are repeating this inaccurate spin of their findings.
———————————————————————-
(UPDATE — For a post anticipating responses from Gates, see here.)
My friend and colleague, Marcus Winters, has a new book out on how to improve the quality of the teaching workforce. Teachers Matter is an excellent summary of the literature on how best to recruit, train, and motivate teachers. It’s a must-read for anyone interested in merit pay, credentialing, and teacher evaluation. It’s a particularly good book to assign for classes that cover these subjects. Check it out.
(Guest Post by Matthew Ladner)
The New York Times has a very nice feature on Clint and the GI litigation team. That scorpion may have to hunt and peck to type, but the sting packs a wallop!
Rick Hanushek interviews Terry Moe about his new book, Special Interest, which is the definitive, new work on teacher unions and education.