Why is the man with the goatee smiling?

October 16, 2014

It might have something to do with this new report from MDRC showing a 9.4% increase in graduation rates in NYC in the “small high-schools” initiative. Students attending small high schools attended college at an 8.4% higher rate as well.

So just to review, Gates FF had a winning strategy on their hands- it had a plausible theory but not much empirical support. Sadly they dropped this strategy before waiting for empirical evaluations, which continue to pile up and have strongly positive results. The siren call of central planning lured them into an endless quagmire that also lacks empirical support (see Hanushek and Loveless) and also lacks a plausible theory of change. Small schools now lacks neither of these things.

There’s one obvious solution to all of this- he’s tan, rested and ready and he’s bringing back socks and sandals! Or perhaps it would be more accurate to say that he is bringing in socks and sandals for the first time. Regardless- bring back Tom Vander Ark!



What Success Would Have Looked Like

January 10, 2013

Yesterday I described the Gates Foundation’s Measuring Effective Teachers (MET) project as “an expensive flop.”  To grasp just what a flop the project was, it’s important to consider what success would have looked like.  If the project had produced what Gates was hoping, it would have found that classroom observations were strong, independent predictors of other measures of effective teaching, like student test score gains.  Even better, they were hoping that the combination of classroom observations, student surveys, and previous test score gains would be a much better predictor of future test score gains (or of future classroom observations) than any one of those measures alone.  Unfortunately, MET failed to find anything like this.

If MET had found classroom observations to be strong predictors of other indicators of effective teaching and if the combination of measures were a significantly better predictor than any one measure alone, then Gates could have offered evidence for the merits of a particular mixing formula or range of mixing formulas for evaluating teachers.  That evidence could have been used to good effect to shape teacher evaluation systems in Chicago, LA, and everywhere else.

They also could have genuinely reassured teachers anxious about the use of test score gains in teacher evaluations.  MET could have allayed those concerns by telling teachers that test score gains produce information that is generally similar to what is learned from well-conducted classroom observations, so there is no reason to oppose one and support the other.  What’s more, significantly improved predictive power from a mixture of classroom observations with test score gains could have made the case for why we need both.

MET was also supposed to have helped us adjudicate among several commonly used rubrics for classroom observations so that we would have solid evidence for preferring one approach over another.  Because MET found that classroom observations in general are barely related to other indicators of teacher effectiveness, the study told us almost nothing about the criteria we should use in classroom observations.

In addition, the classroom observation study was supposed to help us identify the essential components of effective teaching .  That knowledge could have informed improved teacher training and professional development.  But because MET was a flop (because classroom observations barely correlate with other indicators of teacher effectiveness and fail to improve the predictive power of a combined measure), we haven’t learned much of anything about the practices that are associated with effective teaching.  If we can’t connect classroom observations with effective teaching in general, we certainly can’t say much about the particular aspects of teaching that were observed that most contributed to effective teaching.

Just so you know that I’m not falsely attributing to MET these goals that failed to be realized, look at this interview from 2011 of Bill Gates by Jason Riley in the Wall Street Journal.  You’ll clearly see that Bill Gates was hoping that MET would do what I described above.  It failed to do so.  Here is what the interview revealed about the goals of MET:

Of late, the foundation has been working on a personnel system that can reliably measure teacher effectiveness. Teachers have long been shown to influence students’ education more than any other school factor, including class size and per-pupil spending. So the objective is to determine scientifically what a good instructor does.

“We all know that there are these exemplars who can take the toughest students, and they’ll teach them two-and-a-half years of math in a single year,” he says. “Well, I’m enough of a scientist to want to say, ‘What is it about a great teacher? Is it their ability to calm down the classroom or to make the subject interesting? Do they give good problems and understand confusion? Are they good with kids who are behind? Are they good with kids who are ahead?’

“I watched the movies. I saw ‘To Sir, With Love,'” he chuckles, recounting the 1967 classic in which Sidney Poitier plays an idealistic teacher who wins over students at a roughhouse London school. “But they didn’t really explain what he was doing right. I can’t create a personnel system where I say, ‘Go watch this movie and be like him.'”

Instead, the Gates Foundation’s five-year, $335-million project examines whether aspects of effective teaching—classroom management, clear objectives, diagnosing and correcting common student errors—can be systematically measured. The effort involves collecting and studying videos of more than 13,000 lessons taught by 3,000 elementary school teachers in seven urban school districts.

“We’re taking these tapes and we’re looking at how quickly a class gets focused on the subject, how engaged the kids are, who’s wiggling their feet, who’s looking away,” says Mr. Gates. The researchers are also asking students what works in the classroom and trying to determine the usefulness of their feedback.

Mr. Gates hopes that the project earns buy-in from teachers, which he describes as key to long-term reform. “Our dream is that in the sample districts, a high percentage of the teachers determine that this made them better at their jobs.” He’s aware, though, that he’ll have a tough sell with teachers unions, which give lip service to more-stringent teacher evaluations but prefer existing pay and promotion schemes based on seniority—even though they often end up matching the least experienced teachers with the most challenging students.

The final MET reports produced virtually nothing that addressed these stated goals.  But in Orwellian fashion, the Gates folks have declared the project to be a great success.  I never expected MET to work because I suspect that effective teaching is too heterogeneous to be captured well by a single formula.  There is no recipe for effective teaching because kids and their needs are too varied, teachers and their abilities are too varied, and the proper matching of student needs and teacher abilities can be accomplished in many different ways.  But this is just my suspicion.  I can’t blame the Gates Foundation for trying to discover the secret sauce of effective teaching, but I can blame them for refusing to admit that they failed to find it.  Even worse, I blame them for distorting, exaggerating, and spinning what they did find.

(edited for typos)

Understanding the Gates Foundation’s Measuring Effective Teachers Project

January 9, 2013

If I were running a school I’d probably want to evaluate teachers using a mixture of student test score gains, classroom observations, and feedback from parents, students, and other staff.  But I recognize that different schools have different missions and styles that can best be assessed using different methods.  I wouldn’t want to impose on all schools in a state or the nation a single, mechanistic system for evaluating teachers since that is likely to be a one size fits none solution.  There is no single best way to evaluate teachers, just like there is no single best way to educate students.

But the folks at the Gates Foundation, afflicted with PLDD, don’t see things this way.  They’ve been working with politicians in Illinois, Los Angeles, and elsewhere to centrally impose teacher evaluation systems, but they’ve encountered stiff resistance.  In particular, they’ve noticed that teachers and others have expressed strong reservations about any evaluation system that relies too heavily on student test scores.

So the folks at Gates have been trying to scientifically validate a teacher evaluation system that involves a mix of test score gains, classroom observations, and student surveys so that they can overcome resistance to centrally imposed, mechanistic evaluation systems.  If they can reduce reliance on test scores in that system while still carrying the endorsement of “science,” the Gates folk imagine  that politicians, educators, and others will all embrace the Gates central planning fantasy.

Let’s leave aside for the moment the political reality, demonstrated recently in Chicago and Los Angeles, that teachers are likely to fiercely resist any centrally imposed, mechanistic evaluation system regardless of the extent to which it relies on test scores.  The Gates folks want to put on their lab coats and throw the authority of science behind a particular approach to teacher evaluation.  If you oppose it you might as well deny global warming.  Science has spoken.

So it is no accident that the release of the third and final round of reports from the Gates Foundation’s Measuring Effective Teachers project was greeted with the following headline in the Washington Post: “Gates Foundation study: We’ve figured out what makes a good teacher,”  or this similarly humble claim in the Denver Post: “Denver schools, Gates foundation identify what makes effective teacher.”  This is the reaction that the Gates Foundation was going for — we’ve used science to discover the correct formula for evaluating teachers.  And by implication, we now know how to train and improve teachers by using the scientifically validated methods of teaching.

The only problem is that things didn’t work out as the Gates folks had planned.  Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system.  You have to dig to find this, but it’s right there in Table 1 on page 10 of one of the technical reports released yesterday.  In a regression to predict student test score gains using out of sample test score gains for the same teacher, student survey results, and classroom observations, there is virtually no relationship between test score gains and either classroom observations or student survey results.  In only 3 of the 8 models presented is there any statistically significant relationship between either classroom observations or student surveys and test score gains (I’m excluding the 2 instances were they report p < .1 as statistically significant).  And in all 8 models the point estimates suggest that a standard deviation improvement in classroom observation or student survey results is associated with less than a .1 standard deviation increase in test score gains.

Not surprisingly, a composite teacher evaluation measure that mixes classroom observations and student survey results with test score gains is generally no better and sometimes much worse at predicting out of sample test score gains.  The Gates folks trumpet the finding that the combined measures are more “reliable” but that only means that they are less variable, not any more predictive.

But “the best mix” according to the “policy and practitioner brief” is “a composite with weights between 33 percent and 50 percent assigned to state test scores.”  How do they know this is the “best mix?”  It generally isn’t any better at predicting test score gains.  And to collect the classroom observations involves an enormous expense and hassle.  To get the measure as “reliable” as they did without sacrificing too much predictive power, the Gates team had to observe each teacher at least four different times by at least two different coders, including one coder outside of the school.  To observe 3.2 million public school teachers for four hours by staff compensated at $40 per hour would cost more than $500 million each year.  The Gates people also had to train the observers at least 17 hours and even after that had to throw out almost a quarter of those observers as unreliable.  To do all of this might cost about $1 billion each year.

And what would we get for this billion?  Well, we might get more consistent teacher evaluation scores, but we’d get basically no improvement in the identification of effective teachers.  And that’s the “best mix?”  Best for what?  It’s best for the political packaging of a centrally imposed, mechanistic teacher evaluation system, which is what this is all really about.  Vicki Phillips, who heads the Gates education efforts, captured in this comment what I think they are really going for with a composite evaluation score:

Combining all three measures into a properly weighted index, however, produced a result “teachers can trust,” said Vicki Phillips, a director in the education program at the Gates Foundation.

It’ll cost a fortune, it doesn’t improve the identification of effective teachers, but we need to do it to overcome resistance from teachers and others.  Not only will this not work, but in spinning the research as they have, the Gates Foundation is clearly distorting the straightforward interpretation of their findings: a mechanistic system of classroom observation provides virtually nothing for its enormous cost and hassle.  Oh, and this is the case when no stakes were attached to the classroom observations.  Once we attach all of this to pay or continued employment, their classroom observation system will only get worse.

I should add that if classroom observations aren’t useful as predictors, they also can’t be used effectively for diagnostic purposes.  An earlier promise of this project is that they would figure out which teacher evaluation rubrics were best and which sub-components of those rubrics that were most predictive of effective teaching.  But that clearly hasn’t panned out.  In the new reports I can’t find anything about the diagnostic potential of classroom observations, which is not surprising since those observations are not predictive.

So, rather than having “figured out what makes a good teacher” the Gates Foundation has learned very little in this project about effective teaching practices.  The project was an expensive flop.  Let’s not compound the error by adopting this expensive flop as the basis for centrally imposed, mechanistic teacher evaluation systems nationwide.

(Edited for typos and to add links.  To see a follow-up post, click here.)

Gates Gets Groovy, Invests in Mood Rings

June 19, 2012

Building on their earlier $1.4 million investment in bracelets to measure skin conductivity (sweating) as a proxy for student engagement, the Gates Foundation has decided to embark on a multi-million dollar investment in mood rings.

As you can see from their research results pictured above, the mood ring is capable of identifying a variety of student emotional states that could affect the learning environment.  Teachers need to be particularly wary of the “hungry for waffles” mood because it is sometimes followed by the “flatulence” or “full bladder” mood.

Besides, mood rings are pretty groovy.  And they can’t be any dumber than these Q Sensor bracelets.

Gates Goes Wild

June 19, 2012

Gates researchers using science to enhance student learning

Even a blind squirrel occasionally finds an acorn.  Well, Diane Ravitch, Susan Ohanion, Leonie Haimson, and their tinfoil hat crew have stumbled upon some of the craziest stuff I’ve ever heard in ed reform.  It appears the Gates Foundation has spent more than $1 million to develop Galvanic Skin Response bracelets to gauge student response to instruction as part of their Measuring Effective Teachers project.  The Galvanic Skin Response measures the electrical conductance of the skin, which varies largely due to the moisture from people’s sweat.

Stephanie Simon, a Reuters reporter, summarizes the Gates effort:

The foundation has given $1.4 million in grants to several university researchers to begin testing the devices in middle-school classrooms this fall.

The biometric bracelets, produced by a Massachusetts startup company, Affectiva Inc, send a small current across the skin and then measure subtle changes in electrical charges as the sympathetic nervous system responds to stimuli. The wireless devices have been used in pilot tests to gauge consumers’ emotional response to advertising.

Gates officials hope the devices, known as Q Sensors, can become a common classroom tool, enabling teachers to see, in real time, which kids are tuned in and which are zoned out.

Um, OK.  We’ve already written about how unreliable the Gates Foundation is in describing their own research, here and here.  And we’ve already written about how the entire project of using science to discover the best way to teach is a fool’s enterprise.

And now the Gates Foundation is extending that foolish enterprise to include measuring Galvanic Skin Response as a proxy for student engagement.  This simply will not work.  The extent to which students sweat is not a proxy for engagement or for learning.  It is probably a better proxy for whether they are seated near the heater or next to a really pretty girl (or handsome boy).

Galvanic Skin Response has already been widely used as part of the “scientific” effort to detect lying.  And as any person who actually cares about science knows — lie detectors do not work.  Sweating is no more a sign of lying than it is of student engagement.

I’m worried that the Gates Foundation is turning into a Big Bucket of Crazy.  Anyone who works for Gates should be worried about this.  Anyone who is funded by Gates should be worried about this.  If people don’t stand up and tell Gates that they are off the rails, the reputation of everyone associated with Gates will be tainted.

Gates, the Bizarro Foundation

January 31, 2012

Comic book geeks are familiar with Bizarro World, a place where everything is the opposite of what it is in the normal world.  In Bizarro World, people would abandon a policy strongly supported by rigorous evidence while embracing an alternative policy for which the evidence showed little promise.

I was thinking about Bizarro World and then it struck me — Perhaps the Gates Foundation has somehow fallen into the Bizarro World.  It’s just about the only thing that makes sense of their Bizarro choices with respect to education reform strategies.

The dominant education reform strategy of the Gates Foundation before 2006 was to break large high schools into smaller ones, often using school choice and charter schools.  As a Business Week profile put it:

The foundation embraced what many social scientists had concluded was the prime solution: Instead of losing kids in large schools like Manual, the new thinking was to divide them into smaller programs with 200 to 600 students each. Doing so, numerous studies showed, would help prevent even hard-to-reach students from falling through the cracks. The foundation didn’t set out to design schools or run them. Its goal was to back some creative experiments and replicate them nationally.

But the Gates Foundation wasn’t patient enough to let the experiments produce results.  Instead, they hired SRI and AIR to do a very weakly-designed non-experimental evaluation that produced disappointing results.  Gates had also commissioned a rigorous random-assignment evaluation by MDRC, but it would take a few more years to see if students graduated and went on to college at higher rates if they were assigned by lottery to a smaller school.

Gates couldn’t wait.  They were convinced that small schools were a flop, so they began to ditch the small school strategy and look for a new Big Idea.  Tom Vander Ark, the education chief who had championed small schools, was out the door and replaced with Vicki Phillips, a superintendent whose claim to fame, such as it was, came from serving as Portland’s superintendent where she consolidated schools (not breaking them into smaller ones) and centralized control over curriculum and instruction.  As one local observer put it:

In her time in the famously progressive, consensus-driven city, she closed six schools, merged nearly two dozen others through K-8 conversions, pushed to standardize the district’s curriculum, and championed new and controversial measures for testing the district’s 46,000 children-all mostly without stopping for long enough to adequately address the concerns her changes generated in the neighborhoods and schools where they played out.  During her three years in Portland, Phillips’ name became synonymous with top-down management, corporate-style reforms, and a my-way-or-the-highway attitude.

Under Phillips and deputy education director, Harvard professor Tom Kane, the Gates Foundation has pursued a very different strategy: attempting to identify the best standards, curriculum, and pedagogy and then imposing those best practices through a national system of standards and testing.

And here is where we see that Gates must be the Bizarro Foundation.  The previous strategy of backing small schools has now been vindicated by the rigorous random-assignment study Gates couldn’t wait for.  According to the New York Times:

The latest findings show that 67.9 percent of the students who entered small high schools in 2005 and 2006 graduated four years later, compared with 59.3 percent of the students who were not admitted and instead went to larger schools. The higher graduation rate at small schools held across the board for all students, regardless of race, family income or scores on the state’s eighth-grade math and reading tests, according to the data.

This increase was almost entirely accounted for by a rise in Regents diplomas, which are considered more rigorous than a local diploma; 41.5 percent of the students at small schools received one, compared with 34.9 percent of students at other schools. There was little difference between the two groups in the percentage of students who earned a local diploma or the still more rigorous Advanced Regents diploma.

Small-school students also showed more evidence of college readiness, with 37.3 percent of the students earning a score of 75 or higher on the English Regents, compared with 29.7 percent of students at other schools. There was no significant difference, however, in scores on the math Regents.

Meanwhile, as part of their newly embraced top-down strategy, the Gates effort to identify the secret formula for effective teaching has failed to bear fruit.  The Gates -operated Measuring Effective Teachers Project failed to identify any rubric of observing teachers or any components of those rubrics that were strongly predictive of gains in student learning.  And the Gates-backed “research” supporting the federally-orchestrated Common Core push for national standards and testing has been strikingly lacking in scientific rigor and candor.

In short, the Gates Foundation has ditched what rigorous evidence shows worked and is pushing a new strategy completely unsupported by rigorous evidence.  They must be in Bizarro World.  Somebody please get me some blue kryptonite.

Anticipating Responses from Gates

January 9, 2012

Over the weekend I posted about how I thought the Gates Foundation was spinning the results of their Measuring Effective Teachers Project to suggest that the combination of student achievement gains, student surveys, and classroom observations was the best way to have a predictive measure of teacher effectiveness.  Let me anticipate some of the responses they may have:

1) They might say that they clearly admit the limitations of classroom observations and therefore are not guilty of spinning the results to inflate their importance.  They could point to p. 15 of the research paper in which they write: “When value-added data are available, classroom observations add little to the ability to predict value-added gains with other groups of students. Moreover, classroom observations are less reliable than student feedback, unless many different observations are added together.”

Response: I said in my post over the weekend that the Gates folks were careful so that nothing in the reports is technically incorrect.  The distortion of their findings comes from the emphasis and manner of presentation.  For example, the summary of findings in the research paper on p. 9 states: “Combining observation scores with evidence of student achievement gains and student feedback improved predictive power and reliability.”  Or the “key findings” in the practitioner brief on p. 5 say: “”Observations alone, even when scores from multiple observations were averaged together, were not as reliable or predictive of a teacher’s student achievement gains with another group of students as a measure that combined observations with student feedback  and achievement gains on state tests.”  Notice that these summaries of the results fail to mention the most straightforward and obvious finding: classroom observations are really expensive and cumbersome and yet do almost nothing to improve the predictiveness of student achievement-based measures of teacher quality.

And the proof that the results are being spun is that the media coverage uniformly repeats the incorrect claim that multiple measures are an important improvement on test scores alone.  Either all of the reporters are lousy and don’t understand the reports or the reporters are accurately repeating what they are being told and what they overwhelmingly see in the reports.  My money is on the latter explanation.

And further proof that the reporters are being spun is that Vicki Phillips, the Gates education chief, is quoted in the LA Times coverage mis-characterizing the findings: “Using these methods to evaluate teachers is ‘more predictive and powerful in combination than anything we have used as a proxy in the past,’ said Vicki Phillips, who directs the Gates project.”  This is just wrong.  As I pointed out in my previous post, the combined measure is no more predictive than student achievement by itself.

Lastly, the standard for fair and accurate reporting of results is not whether one could find any way to show that technically the description of findings is not false.  We should expect the most straightforward and obvious description of findings emphasized.  With the Gates folks I feel like I am repeatedly parsing what the meaning of the word “is” is.  That’s political spin, not research.

2) They might say that classroom observations are an important addition because at least they provide diagnostic information about how teachers can improve, while test scores cannot.

Response:  This may be true, but it is not a claim supported by the Gates study.  They found that all of the different classroom observation methods they tried had very weak predictive power.  You can’t provide a lot of feedback about how to improve student achievement based on instruments that are barely correlated with gains in student achievement.  In addition, they were unable to find sub-components of the classroom observation methods that were more predictive, so they can’t tell teachers that they really need to do certain things, since those things are much more strongly related to student learning gains.  Lastly, it is simply untrue that test scores cannot be diagnostic.  There are sub-components of the tests that measure learning in different aspects of the subject.  Teachers could be told to emphasize more those areas on which their students have lagged.

3) They may say that classroom observations and students surveys improve the reliability of a teacher quality measure when combined with test scores.

Response: An increase in reliability is cold comfort for a lack of predictive power.  Reliability is just an indicator of how consistent a measure is.  There are plenty of measures that are very consistent but not helpful in predicting teacher quality.  For example, if we asked students to rate how attractive their teacher was, we would probably get a very “reliable” (consistent) measure from year to year and section to section.  But that consistency would not make up for the fact that attractiveness is unlikely to help improve the prediction of effective teaching.  So, the student survey has a high amount of consistency, but who knows what that is really measuring since it is only weakly related to student learning gains.  It is consistent, but consistently wrong.  Our focus should be on the predictive power of teacher evaluations and classrooms observations and student surveys don’t really do anything to help with that (at least, not according to the Gates study).

4) They may say that classroom observations and student surveys improve on the prediction of student effort and classroom environment.

Response: As I mentioned in the post over the weekend, they don’t really have validated measures of student effort and classroom environment.  The Gates folks took a lot of flack last year for focusing on test-score gains, so they came up with some non-test score outcome measures simply by taking some of the items from the students survey where students are asked about their effort or classroom environment.  We have no idea whether they have really measured the amount of effort students exert or the quality of the classroom environment, they are just using some survey answers on those items and claiming that they have measured those “outcomes.”  The only validated outcome measure we have in the Gates study are the test score gains, so we have to focus on that.


The good news is that my fears about the Gates study being used to dictate what teachers do have not been realized, at least not yet.  But it wasn’t for lack of trying.  If the classroom observations had worked a little better in predicting student learning gains, I’m sure we would have heard about how teachers should run their classrooms to produce greater gains.  But the classroom observations were so much of a dud that gates education chief, Vicki Phillips, didn’t even attempt to claim that they found that drill and kill is bad or that teachers should avoid teaching to the test.

But the inability to use the classroom observations to tell teachers the “right” way of teaching is another way of saying that the classroom observations are not able to be used for diagnostic purposes.  The most straightforward reading of the Gates results is that classroom observations appear to be an expensive and ineffective dud.  But it’s hard for an organization that spends $45 million on a project to scientifically validate classroom observations to admit that it failed.   It’s hard enough for a third-party evaluator to say that, let alone an in-house study about a key aspect of the Gates policy agenda.

How the Gates Foundation Spins its Research

January 7, 2012

The Gates Foundation has released the next installment of reports in their Measuring Effective Teachers Project.  When the last report was released, I found myself in a tussle with the Gates folks and Sam Dillon at the New York Times because I noted that the study’s results didn’t actually support the finding attributed to it.  Vicki Phillips, the education chief at Gates,  told the NYT and LA Times that the study showed that “drill and kill” and “teaching to the test” hurt student achievement when the study actually found no such thing.

With the latest round of reports, the Gates folks are back to their old game of spinning their results to push policy recommendations that are actually unsupported by the data.  The main message emphasized in the new round of reports is that we need multiple measures of teacher effectiveness, not just value-added measures derived from student test scores, to make reliable and valid predictions about how effective different teachers are at improving student learning.

This is the clear thrust of the newly released Policy and Practice Brief  and Research Paper and is obviously what the reporters are being told by the Gates media people.  For example, Education Week summarizes the report as follows:

…the study indicates that the gauges that appear to make the most finely grained distinctions of teacher performance are those that incorporate many different types of information, not those that are exclusively based on test scores.

And Ed Sector says:

The findings demonstrate the importance of multiple measures of teacher evaluation: combining observation scores, student achievement gains, and student feedback provided the most reliable and predictive assessment of a teacher’s effectiveness.

But buried away on p. 51 of the Research Paper in Table 16 we see that value-added measures based on student test results — by themselves — are essentially as good or better than the much more expensive and cumbersome method of combining them with student surveys and classroom observations when it comes to predicting the effectiveness of teachers.  That is, the new Gates study actually finds that multiple measures are largely a waste of time and money when it comes to predicting the effectiveness of teachers at raising student scores in math and reading.

According to Table 16, student achievement gains correlate with the underlying value-added by teachers at .69. If the test scores are combined (with an equal weighting) with the results of a student survey and classroom observations that rate teachers according to a variety of commonly-used methods, the correlation to underlying value-added drops to be between .57 and .61.  That is, combining test scores with other measures where all measures are equally weighted actually reduces reliability.

The researchers also present the results of a criteria weighted combination of student achievement gains, student surveys, and classroom observations based on the regression coefficients of how predictive each is of student learning growth in other sections for the same teacher.  Based on this the test score gains are weighted at .729, the student survey at .179, and the classroom observations at .092.  This tells us how much more predictive test score gains are than student surveys or classroom observations.  Yet even when test score gains constitute 72.9% of the combined measure, the correlation to underlying teacher quality still ranges between .66 and .72, depending on which method is used for rating the classroom observations.  The criteria-weighted combined measure provides basically no improvement in reliability over using test score gains by themselves.

And using multiple measures does not improve our ability to distinguish between effective and ineffective teachers.  Using test scores alone the difference between the top quartile and bottom quartile teacher in producing  student value-added is .24 standard deviations in math learning growth on the state test.  If we combine test scores with student surveys and classroom observations using an equal weighting, the difference between top and bottom quartile teachers shrinks to be between .19 and .21.  If we use the criteria weights, where test scores are 72.9% of the combined measure, the gap between top and bottom teacher ranges between .22 and .25.  In short, using multiple measures does not improve our ability to distinguish between effective and ineffective teachers.

The same basic pattern of results holds true for reading, which can be seen in Table 20 on p. 55 of the report.  Combining test score measures of teacher effectiveness with student surveys and classroom observations does improve a little our ability to predict how students would answer survey items about their effort in schools as well as how they felt about their classroom environment.  But unlike test scores, which have been shown to be strong predictors of later life outcomes, I have no idea whether these survey items accurately capture what they intend or have any importance for students’ lives.

Adding the student surveys and classroom observation measures to test scores yields almost no benefits, but it adds an enormous amount of cost and effort to a system for measuring teacher effectiveness.  To get the classroom observations to be usable, the Gates researchers had to have four independent observations of those classrooms by four separate people.  If put into practice in schools that would consume an enormous amount of time and money.  In addition, administering, scoring, and combing the student survey also has real costs.

So, why are the Gates folks saying that their research shows the benefits of multiple measures of teacher effectiveness when their research actually suggests virtually no benefits to combining other measures with test scores and when there are significant costs to adding those other measures?  The simple answer is politics.  Large numbers of educators and a segment of the population find relying solely on test scores for measuring teacher effectiveness to be unpalatable, but they might tolerate a system that combined test scores with classroom observations and other measures.  Rather than using their research to explain that these common preferences for multiple measures are inconsistent with the evidence, the Gates folks want to appease this constituency so that they can put a formal system of systematically measuring teacher effectiveness in place.  The research is being spun to serve a policy agenda.

This spinning of the findings  is not just an accident or the results of a misunderstanding.  It is clearly deliberate.  Throughout the two reports Gates just released, they regularly engage in the same pattern of presenting the information. They show that the classroom observation measures by themselves have weak reliability and validity in predicting effective teachers.  But if you add the student survey and then add the test score measures, you get much better measures of effective teachers.  This pattern of presentation suggests the importance of multiple measures, since the classroom observations are strengthened when other measures are added.  The only place you find the reliability and validity of test scores by themselves is at the bottom of the Research Paper in Tables 16 and 20.  If both the lay-version and technical reports had always shown how little test scores are improved by adding student surveys and classroom observations, it would be plain that test scores alone are just about as good as multiple measures.

The Gates folks never actually inaccurately describe their results (as Vicki Phillips did with the previous report).  But they are careful to frame the findings as consistently as possible with the Gates policy agenda of pushing a formal system of measuring teacher effectiveness that involves multiple measures.  And it worked, since the reporters are repeating this inaccurate spin of their findings.


(UPDATE — For a post anticipating responses from Gates, see here.)

Gates Responds

October 26, 2011

Steve Cantrell, a senior researcher at Gates, sent me an email last night in response to my post from yesterday asking for the MET results to be released.  He said that I was right in suggesting that large, complicated projects sometimes take longer than originally planned.  He said that final scores for coding the videos had just been delivered to the research team and that the full results for the 2009-10 year were now scheduled to be released January 5, 2012.  It’s unclear whether that report will also contain information for the 2010-11 year as well.  The MET web site will be changed to reflect this new schedule.  (Update: According to another email from Steve Cantrell, the January release will only have the full 09-10 results.  The final results including 10-11 and are scheduled for release in early summer of 2012 .)

Steve also clarified information on the cost of the project.  Last year I repeated the New York Times and LA Times description of the project costing $45 million.  More recently I’ve repeated the Wall Street Journal description of the project cost as $335 million.  Steve resolved the confusion by saying that the MET study costs about $50 million and the $335 million figure includes grants to the partner districts.

Let me be clear that I think Gates has a lot of good and smart people working on the MET project.  My concern is not that these are bad people.  My concern is that Gates has a flawed strategy based on centrally identifying what educators should do and then building a system of standards, curriculum, and assessments to impose those practices on the education system.  I don’t think this kind of centralized approach can work and I fear that it creates enormous pressure on good and smart researchers to toe the centralized line — even if it becomes obvious that it is not working.  Everyone at Gates can see what happened to the folks who pushed small schools when the Foundation decided that approach was not working.

And unlike Diane Ravitch, Valerie Strauss, and the Army of Angry Teachers, I am not criticizing the Gates Foundation because I think Bill Gates is in the “billionaire boys club” and therefore somehow disqualified from using his wealth to try to improve education.  I am critical of recent Gates Foundation efforts because I believe Gates can and should try to improve education by adopting a more fruitful strategy.

(corrected typos)

Gates Foundation — Release the MET Results

October 25, 2011

A sketch of the $500 million new Gates Foundation headquarters

Bill and Melinda Gates mentioned again in the Wall Street Journal the Measuring Effective Teachers (MET) project that their foundation is orchestrating.  Bill and Melinda may want to check on the status of the MET research they’ve been touting since full results were promised in the spring of 2011 and have yet to be released.

Just to review… In an earlier interview with the Journal, MET was described as follows:

the Gates Foundation’s five-year, $335-million project examines whether aspects of effective teaching, classroom management, clear objectives, diagnosing and correcting common student errors can be systematically measured. The effort involves collecting and studying videos of more than 13,000 lessons taught by 3,000 elementary school teachers in seven urban school districts.

The motivation, re-iterated in the new piece by Bill and Melinda Gates is to identify  what “works” in classroom teaching to develop systems that train and encourage other teachers to imitate those practices:

It may surprise you—it was certainly surprising to us—but the field of education doesn’t know very much at all about effective teaching. We have all known terrific teachers. You watch them at work for 10 minutes and you can tell how thoroughly they’ve mastered the craft. But nobody has been able to identify what, precisely, makes them so outstanding….

The intermediate goal of MET is to discover what we are able to measure that is predictive of student success. The end goal is to have a better sense of what makes teaching work so that school districts can start to hire, train and promote based on meaningful standards.

As I’ve argued before, using research to identify “best practices” in teaching only makes sense if the same teaching approaches would be desirable for the vast majority of teachers and students, regardless of the context.  And as I’ve also  suggested before, I don’t believe this effort is likely to yield much in education.  Effective teaching is like effective parenting — it is highly dependent on the circumstances.  Yes, there are some parenting (and teaching) techniques that are generally effective for almost everyone, but those are mostly known and already in use.

This doesn’t mean we are completely unable to measure effective teaching (or parenting).  It just means that we have to judge it by the results and cannot easily make universal statements about the right methods for producing those results.  To make a sports analogy, there is no single “best practice” for hitters in baseball.  There are a variety of stances and swings.  The best way to judge an effective hitter is by the results, not by the stance or swing.  And if we tried to make all hitters stand and swing in the same way, we’d make a lot of them worse hitters.

It is because of this heterogeneity in effective teaching practices that I think the MET project is doomed to disappoint.  And according to inside sources, I’ve heard that results are being delayed because they are failing to produce much of anything.

According to the MET web site, the full results for the 1st year should have been released in the spring:

 In spring 2011, the project will release full results from the first year of the study, including predictors of teaching effectiveness and correlation with value-added assessments.

It is almost November and we have not seen these results.  I understand that in very large and complicated projects, like MET, things can take much longer than originally planned.  If so, it would be nice to hear that explanation.  It would be even nicer if the Gates Foundation released results if they have them, even if those results were not what they had hoped they would find.

Some inquisitive reporters should start asking Gates officials and members of the research team about the status of the MET results.  Reporters should go beyond talking to the media flacks at Gates HQ and actually talk to individual members of the team confidentially.  If they do that, they may confirm what I have been hearing: MET results have been delayed because they aren’t panning out.

(UPDATE:  Gates responds.

%d bloggers like this: