Why is the man with the goatee smiling?

October 16, 2014

It might have something to do with this new report from MDRC showing a 9.4% increase in graduation rates in NYC in the “small high-schools” initiative. Students attending small high schools attended college at an 8.4% higher rate as well.

So just to review, Gates FF had a winning strategy on their hands- it had a plausible theory but not much empirical support. Sadly they dropped this strategy before waiting for empirical evaluations, which continue to pile up and have strongly positive results. The siren call of central planning lured them into an endless quagmire that also lacks empirical support (see Hanushek and Loveless) and also lacks a plausible theory of change. Small schools now lacks neither of these things.

There’s one obvious solution to all of this- he’s tan, rested and ready and he’s bringing back socks and sandals! Or perhaps it would be more accurate to say that he is bringing in socks and sandals for the first time. Regardless- bring back Tom Vander Ark!



What Success Would Have Looked Like

January 10, 2013

Yesterday I described the Gates Foundation’s Measuring Effective Teachers (MET) project as “an expensive flop.”  To grasp just what a flop the project was, it’s important to consider what success would have looked like.  If the project had produced what Gates was hoping, it would have found that classroom observations were strong, independent predictors of other measures of effective teaching, like student test score gains.  Even better, they were hoping that the combination of classroom observations, student surveys, and previous test score gains would be a much better predictor of future test score gains (or of future classroom observations) than any one of those measures alone.  Unfortunately, MET failed to find anything like this.

If MET had found classroom observations to be strong predictors of other indicators of effective teaching and if the combination of measures were a significantly better predictor than any one measure alone, then Gates could have offered evidence for the merits of a particular mixing formula or range of mixing formulas for evaluating teachers.  That evidence could have been used to good effect to shape teacher evaluation systems in Chicago, LA, and everywhere else.

They also could have genuinely reassured teachers anxious about the use of test score gains in teacher evaluations.  MET could have allayed those concerns by telling teachers that test score gains produce information that is generally similar to what is learned from well-conducted classroom observations, so there is no reason to oppose one and support the other.  What’s more, significantly improved predictive power from a mixture of classroom observations with test score gains could have made the case for why we need both.

MET was also supposed to have helped us adjudicate among several commonly used rubrics for classroom observations so that we would have solid evidence for preferring one approach over another.  Because MET found that classroom observations in general are barely related to other indicators of teacher effectiveness, the study told us almost nothing about the criteria we should use in classroom observations.

In addition, the classroom observation study was supposed to help us identify the essential components of effective teaching .  That knowledge could have informed improved teacher training and professional development.  But because MET was a flop (because classroom observations barely correlate with other indicators of teacher effectiveness and fail to improve the predictive power of a combined measure), we haven’t learned much of anything about the practices that are associated with effective teaching.  If we can’t connect classroom observations with effective teaching in general, we certainly can’t say much about the particular aspects of teaching that were observed that most contributed to effective teaching.

Just so you know that I’m not falsely attributing to MET these goals that failed to be realized, look at this interview from 2011 of Bill Gates by Jason Riley in the Wall Street Journal.  You’ll clearly see that Bill Gates was hoping that MET would do what I described above.  It failed to do so.  Here is what the interview revealed about the goals of MET:

Of late, the foundation has been working on a personnel system that can reliably measure teacher effectiveness. Teachers have long been shown to influence students’ education more than any other school factor, including class size and per-pupil spending. So the objective is to determine scientifically what a good instructor does.

“We all know that there are these exemplars who can take the toughest students, and they’ll teach them two-and-a-half years of math in a single year,” he says. “Well, I’m enough of a scientist to want to say, ‘What is it about a great teacher? Is it their ability to calm down the classroom or to make the subject interesting? Do they give good problems and understand confusion? Are they good with kids who are behind? Are they good with kids who are ahead?’

“I watched the movies. I saw ‘To Sir, With Love,'” he chuckles, recounting the 1967 classic in which Sidney Poitier plays an idealistic teacher who wins over students at a roughhouse London school. “But they didn’t really explain what he was doing right. I can’t create a personnel system where I say, ‘Go watch this movie and be like him.'”

Instead, the Gates Foundation’s five-year, $335-million project examines whether aspects of effective teaching—classroom management, clear objectives, diagnosing and correcting common student errors—can be systematically measured. The effort involves collecting and studying videos of more than 13,000 lessons taught by 3,000 elementary school teachers in seven urban school districts.

“We’re taking these tapes and we’re looking at how quickly a class gets focused on the subject, how engaged the kids are, who’s wiggling their feet, who’s looking away,” says Mr. Gates. The researchers are also asking students what works in the classroom and trying to determine the usefulness of their feedback.

Mr. Gates hopes that the project earns buy-in from teachers, which he describes as key to long-term reform. “Our dream is that in the sample districts, a high percentage of the teachers determine that this made them better at their jobs.” He’s aware, though, that he’ll have a tough sell with teachers unions, which give lip service to more-stringent teacher evaluations but prefer existing pay and promotion schemes based on seniority—even though they often end up matching the least experienced teachers with the most challenging students.

The final MET reports produced virtually nothing that addressed these stated goals.  But in Orwellian fashion, the Gates folks have declared the project to be a great success.  I never expected MET to work because I suspect that effective teaching is too heterogeneous to be captured well by a single formula.  There is no recipe for effective teaching because kids and their needs are too varied, teachers and their abilities are too varied, and the proper matching of student needs and teacher abilities can be accomplished in many different ways.  But this is just my suspicion.  I can’t blame the Gates Foundation for trying to discover the secret sauce of effective teaching, but I can blame them for refusing to admit that they failed to find it.  Even worse, I blame them for distorting, exaggerating, and spinning what they did find.

(edited for typos)

Understanding the Gates Foundation’s Measuring Effective Teachers Project

January 9, 2013

If I were running a school I’d probably want to evaluate teachers using a mixture of student test score gains, classroom observations, and feedback from parents, students, and other staff.  But I recognize that different schools have different missions and styles that can best be assessed using different methods.  I wouldn’t want to impose on all schools in a state or the nation a single, mechanistic system for evaluating teachers since that is likely to be a one size fits none solution.  There is no single best way to evaluate teachers, just like there is no single best way to educate students.

But the folks at the Gates Foundation, afflicted with PLDD, don’t see things this way.  They’ve been working with politicians in Illinois, Los Angeles, and elsewhere to centrally impose teacher evaluation systems, but they’ve encountered stiff resistance.  In particular, they’ve noticed that teachers and others have expressed strong reservations about any evaluation system that relies too heavily on student test scores.

So the folks at Gates have been trying to scientifically validate a teacher evaluation system that involves a mix of test score gains, classroom observations, and student surveys so that they can overcome resistance to centrally imposed, mechanistic evaluation systems.  If they can reduce reliance on test scores in that system while still carrying the endorsement of “science,” the Gates folk imagine  that politicians, educators, and others will all embrace the Gates central planning fantasy.

Let’s leave aside for the moment the political reality, demonstrated recently in Chicago and Los Angeles, that teachers are likely to fiercely resist any centrally imposed, mechanistic evaluation system regardless of the extent to which it relies on test scores.  The Gates folks want to put on their lab coats and throw the authority of science behind a particular approach to teacher evaluation.  If you oppose it you might as well deny global warming.  Science has spoken.

So it is no accident that the release of the third and final round of reports from the Gates Foundation’s Measuring Effective Teachers project was greeted with the following headline in the Washington Post: “Gates Foundation study: We’ve figured out what makes a good teacher,”  or this similarly humble claim in the Denver Post: “Denver schools, Gates foundation identify what makes effective teacher.”  This is the reaction that the Gates Foundation was going for — we’ve used science to discover the correct formula for evaluating teachers.  And by implication, we now know how to train and improve teachers by using the scientifically validated methods of teaching.

The only problem is that things didn’t work out as the Gates folks had planned.  Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system.  You have to dig to find this, but it’s right there in Table 1 on page 10 of one of the technical reports released yesterday.  In a regression to predict student test score gains using out of sample test score gains for the same teacher, student survey results, and classroom observations, there is virtually no relationship between test score gains and either classroom observations or student survey results.  In only 3 of the 8 models presented is there any statistically significant relationship between either classroom observations or student surveys and test score gains (I’m excluding the 2 instances were they report p < .1 as statistically significant).  And in all 8 models the point estimates suggest that a standard deviation improvement in classroom observation or student survey results is associated with less than a .1 standard deviation increase in test score gains.

Not surprisingly, a composite teacher evaluation measure that mixes classroom observations and student survey results with test score gains is generally no better and sometimes much worse at predicting out of sample test score gains.  The Gates folks trumpet the finding that the combined measures are more “reliable” but that only means that they are less variable, not any more predictive.

But “the best mix” according to the “policy and practitioner brief” is “a composite with weights between 33 percent and 50 percent assigned to state test scores.”  How do they know this is the “best mix?”  It generally isn’t any better at predicting test score gains.  And to collect the classroom observations involves an enormous expense and hassle.  To get the measure as “reliable” as they did without sacrificing too much predictive power, the Gates team had to observe each teacher at least four different times by at least two different coders, including one coder outside of the school.  To observe 3.2 million public school teachers for four hours by staff compensated at $40 per hour would cost more than $500 million each year.  The Gates people also had to train the observers at least 17 hours and even after that had to throw out almost a quarter of those observers as unreliable.  To do all of this might cost about $1 billion each year.

And what would we get for this billion?  Well, we might get more consistent teacher evaluation scores, but we’d get basically no improvement in the identification of effective teachers.  And that’s the “best mix?”  Best for what?  It’s best for the political packaging of a centrally imposed, mechanistic teacher evaluation system, which is what this is all really about.  Vicki Phillips, who heads the Gates education efforts, captured in this comment what I think they are really going for with a composite evaluation score:

Combining all three measures into a properly weighted index, however, produced a result “teachers can trust,” said Vicki Phillips, a director in the education program at the Gates Foundation.

It’ll cost a fortune, it doesn’t improve the identification of effective teachers, but we need to do it to overcome resistance from teachers and others.  Not only will this not work, but in spinning the research as they have, the Gates Foundation is clearly distorting the straightforward interpretation of their findings: a mechanistic system of classroom observation provides virtually nothing for its enormous cost and hassle.  Oh, and this is the case when no stakes were attached to the classroom observations.  Once we attach all of this to pay or continued employment, their classroom observation system will only get worse.

I should add that if classroom observations aren’t useful as predictors, they also can’t be used effectively for diagnostic purposes.  An earlier promise of this project is that they would figure out which teacher evaluation rubrics were best and which sub-components of those rubrics that were most predictive of effective teaching.  But that clearly hasn’t panned out.  In the new reports I can’t find anything about the diagnostic potential of classroom observations, which is not surprising since those observations are not predictive.

So, rather than having “figured out what makes a good teacher” the Gates Foundation has learned very little in this project about effective teaching practices.  The project was an expensive flop.  Let’s not compound the error by adopting this expensive flop as the basis for centrally imposed, mechanistic teacher evaluation systems nationwide.

(Edited for typos and to add links.  To see a follow-up post, click here.)

Gates Gets Groovy, Invests in Mood Rings

June 19, 2012

Building on their earlier $1.4 million investment in bracelets to measure skin conductivity (sweating) as a proxy for student engagement, the Gates Foundation has decided to embark on a multi-million dollar investment in mood rings.

As you can see from their research results pictured above, the mood ring is capable of identifying a variety of student emotional states that could affect the learning environment.  Teachers need to be particularly wary of the “hungry for waffles” mood because it is sometimes followed by the “flatulence” or “full bladder” mood.

Besides, mood rings are pretty groovy.  And they can’t be any dumber than these Q Sensor bracelets.

Gates Goes Wild

June 19, 2012

Gates researchers using science to enhance student learning

Even a blind squirrel occasionally finds an acorn.  Well, Diane Ravitch, Susan Ohanion, Leonie Haimson, and their tinfoil hat crew have stumbled upon some of the craziest stuff I’ve ever heard in ed reform.  It appears the Gates Foundation has spent more than $1 million to develop Galvanic Skin Response bracelets to gauge student response to instruction as part of their Measuring Effective Teachers project.  The Galvanic Skin Response measures the electrical conductance of the skin, which varies largely due to the moisture from people’s sweat.

Stephanie Simon, a Reuters reporter, summarizes the Gates effort:

The foundation has given $1.4 million in grants to several university researchers to begin testing the devices in middle-school classrooms this fall.

The biometric bracelets, produced by a Massachusetts startup company, Affectiva Inc, send a small current across the skin and then measure subtle changes in electrical charges as the sympathetic nervous system responds to stimuli. The wireless devices have been used in pilot tests to gauge consumers’ emotional response to advertising.

Gates officials hope the devices, known as Q Sensors, can become a common classroom tool, enabling teachers to see, in real time, which kids are tuned in and which are zoned out.

Um, OK.  We’ve already written about how unreliable the Gates Foundation is in describing their own research, here and here.  And we’ve already written about how the entire project of using science to discover the best way to teach is a fool’s enterprise.

And now the Gates Foundation is extending that foolish enterprise to include measuring Galvanic Skin Response as a proxy for student engagement.  This simply will not work.  The extent to which students sweat is not a proxy for engagement or for learning.  It is probably a better proxy for whether they are seated near the heater or next to a really pretty girl (or handsome boy).

Galvanic Skin Response has already been widely used as part of the “scientific” effort to detect lying.  And as any person who actually cares about science knows — lie detectors do not work.  Sweating is no more a sign of lying than it is of student engagement.

I’m worried that the Gates Foundation is turning into a Big Bucket of Crazy.  Anyone who works for Gates should be worried about this.  Anyone who is funded by Gates should be worried about this.  If people don’t stand up and tell Gates that they are off the rails, the reputation of everyone associated with Gates will be tainted.

Gates, the Bizarro Foundation

January 31, 2012

Comic book geeks are familiar with Bizarro World, a place where everything is the opposite of what it is in the normal world.  In Bizarro World, people would abandon a policy strongly supported by rigorous evidence while embracing an alternative policy for which the evidence showed little promise.

I was thinking about Bizarro World and then it struck me — Perhaps the Gates Foundation has somehow fallen into the Bizarro World.  It’s just about the only thing that makes sense of their Bizarro choices with respect to education reform strategies.

The dominant education reform strategy of the Gates Foundation before 2006 was to break large high schools into smaller ones, often using school choice and charter schools.  As a Business Week profile put it:

The foundation embraced what many social scientists had concluded was the prime solution: Instead of losing kids in large schools like Manual, the new thinking was to divide them into smaller programs with 200 to 600 students each. Doing so, numerous studies showed, would help prevent even hard-to-reach students from falling through the cracks. The foundation didn’t set out to design schools or run them. Its goal was to back some creative experiments and replicate them nationally.

But the Gates Foundation wasn’t patient enough to let the experiments produce results.  Instead, they hired SRI and AIR to do a very weakly-designed non-experimental evaluation that produced disappointing results.  Gates had also commissioned a rigorous random-assignment evaluation by MDRC, but it would take a few more years to see if students graduated and went on to college at higher rates if they were assigned by lottery to a smaller school.

Gates couldn’t wait.  They were convinced that small schools were a flop, so they began to ditch the small school strategy and look for a new Big Idea.  Tom Vander Ark, the education chief who had championed small schools, was out the door and replaced with Vicki Phillips, a superintendent whose claim to fame, such as it was, came from serving as Portland’s superintendent where she consolidated schools (not breaking them into smaller ones) and centralized control over curriculum and instruction.  As one local observer put it:

In her time in the famously progressive, consensus-driven city, she closed six schools, merged nearly two dozen others through K-8 conversions, pushed to standardize the district’s curriculum, and championed new and controversial measures for testing the district’s 46,000 children-all mostly without stopping for long enough to adequately address the concerns her changes generated in the neighborhoods and schools where they played out.  During her three years in Portland, Phillips’ name became synonymous with top-down management, corporate-style reforms, and a my-way-or-the-highway attitude.

Under Phillips and deputy education director, Harvard professor Tom Kane, the Gates Foundation has pursued a very different strategy: attempting to identify the best standards, curriculum, and pedagogy and then imposing those best practices through a national system of standards and testing.

And here is where we see that Gates must be the Bizarro Foundation.  The previous strategy of backing small schools has now been vindicated by the rigorous random-assignment study Gates couldn’t wait for.  According to the New York Times:

The latest findings show that 67.9 percent of the students who entered small high schools in 2005 and 2006 graduated four years later, compared with 59.3 percent of the students who were not admitted and instead went to larger schools. The higher graduation rate at small schools held across the board for all students, regardless of race, family income or scores on the state’s eighth-grade math and reading tests, according to the data.

This increase was almost entirely accounted for by a rise in Regents diplomas, which are considered more rigorous than a local diploma; 41.5 percent of the students at small schools received one, compared with 34.9 percent of students at other schools. There was little difference between the two groups in the percentage of students who earned a local diploma or the still more rigorous Advanced Regents diploma.

Small-school students also showed more evidence of college readiness, with 37.3 percent of the students earning a score of 75 or higher on the English Regents, compared with 29.7 percent of students at other schools. There was no significant difference, however, in scores on the math Regents.

Meanwhile, as part of their newly embraced top-down strategy, the Gates effort to identify the secret formula for effective teaching has failed to bear fruit.  The Gates -operated Measuring Effective Teachers Project failed to identify any rubric of observing teachers or any components of those rubrics that were strongly predictive of gains in student learning.  And the Gates-backed “research” supporting the federally-orchestrated Common Core push for national standards and testing has been strikingly lacking in scientific rigor and candor.

In short, the Gates Foundation has ditched what rigorous evidence shows worked and is pushing a new strategy completely unsupported by rigorous evidence.  They must be in Bizarro World.  Somebody please get me some blue kryptonite.

Anticipating Responses from Gates

January 9, 2012

Over the weekend I posted about how I thought the Gates Foundation was spinning the results of their Measuring Effective Teachers Project to suggest that the combination of student achievement gains, student surveys, and classroom observations was the best way to have a predictive measure of teacher effectiveness.  Let me anticipate some of the responses they may have:

1) They might say that they clearly admit the limitations of classroom observations and therefore are not guilty of spinning the results to inflate their importance.  They could point to p. 15 of the research paper in which they write: “When value-added data are available, classroom observations add little to the ability to predict value-added gains with other groups of students. Moreover, classroom observations are less reliable than student feedback, unless many different observations are added together.”

Response: I said in my post over the weekend that the Gates folks were careful so that nothing in the reports is technically incorrect.  The distortion of their findings comes from the emphasis and manner of presentation.  For example, the summary of findings in the research paper on p. 9 states: “Combining observation scores with evidence of student achievement gains and student feedback improved predictive power and reliability.”  Or the “key findings” in the practitioner brief on p. 5 say: “”Observations alone, even when scores from multiple observations were averaged together, were not as reliable or predictive of a teacher’s student achievement gains with another group of students as a measure that combined observations with student feedback  and achievement gains on state tests.”  Notice that these summaries of the results fail to mention the most straightforward and obvious finding: classroom observations are really expensive and cumbersome and yet do almost nothing to improve the predictiveness of student achievement-based measures of teacher quality.

And the proof that the results are being spun is that the media coverage uniformly repeats the incorrect claim that multiple measures are an important improvement on test scores alone.  Either all of the reporters are lousy and don’t understand the reports or the reporters are accurately repeating what they are being told and what they overwhelmingly see in the reports.  My money is on the latter explanation.

And further proof that the reporters are being spun is that Vicki Phillips, the Gates education chief, is quoted in the LA Times coverage mis-characterizing the findings: “Using these methods to evaluate teachers is ‘more predictive and powerful in combination than anything we have used as a proxy in the past,’ said Vicki Phillips, who directs the Gates project.”  This is just wrong.  As I pointed out in my previous post, the combined measure is no more predictive than student achievement by itself.

Lastly, the standard for fair and accurate reporting of results is not whether one could find any way to show that technically the description of findings is not false.  We should expect the most straightforward and obvious description of findings emphasized.  With the Gates folks I feel like I am repeatedly parsing what the meaning of the word “is” is.  That’s political spin, not research.

2) They might say that classroom observations are an important addition because at least they provide diagnostic information about how teachers can improve, while test scores cannot.

Response:  This may be true, but it is not a claim supported by the Gates study.  They found that all of the different classroom observation methods they tried had very weak predictive power.  You can’t provide a lot of feedback about how to improve student achievement based on instruments that are barely correlated with gains in student achievement.  In addition, they were unable to find sub-components of the classroom observation methods that were more predictive, so they can’t tell teachers that they really need to do certain things, since those things are much more strongly related to student learning gains.  Lastly, it is simply untrue that test scores cannot be diagnostic.  There are sub-components of the tests that measure learning in different aspects of the subject.  Teachers could be told to emphasize more those areas on which their students have lagged.

3) They may say that classroom observations and students surveys improve the reliability of a teacher quality measure when combined with test scores.

Response: An increase in reliability is cold comfort for a lack of predictive power.  Reliability is just an indicator of how consistent a measure is.  There are plenty of measures that are very consistent but not helpful in predicting teacher quality.  For example, if we asked students to rate how attractive their teacher was, we would probably get a very “reliable” (consistent) measure from year to year and section to section.  But that consistency would not make up for the fact that attractiveness is unlikely to help improve the prediction of effective teaching.  So, the student survey has a high amount of consistency, but who knows what that is really measuring since it is only weakly related to student learning gains.  It is consistent, but consistently wrong.  Our focus should be on the predictive power of teacher evaluations and classrooms observations and student surveys don’t really do anything to help with that (at least, not according to the Gates study).

4) They may say that classroom observations and student surveys improve on the prediction of student effort and classroom environment.

Response: As I mentioned in the post over the weekend, they don’t really have validated measures of student effort and classroom environment.  The Gates folks took a lot of flack last year for focusing on test-score gains, so they came up with some non-test score outcome measures simply by taking some of the items from the students survey where students are asked about their effort or classroom environment.  We have no idea whether they have really measured the amount of effort students exert or the quality of the classroom environment, they are just using some survey answers on those items and claiming that they have measured those “outcomes.”  The only validated outcome measure we have in the Gates study are the test score gains, so we have to focus on that.


The good news is that my fears about the Gates study being used to dictate what teachers do have not been realized, at least not yet.  But it wasn’t for lack of trying.  If the classroom observations had worked a little better in predicting student learning gains, I’m sure we would have heard about how teachers should run their classrooms to produce greater gains.  But the classroom observations were so much of a dud that gates education chief, Vicki Phillips, didn’t even attempt to claim that they found that drill and kill is bad or that teachers should avoid teaching to the test.

But the inability to use the classroom observations to tell teachers the “right” way of teaching is another way of saying that the classroom observations are not able to be used for diagnostic purposes.  The most straightforward reading of the Gates results is that classroom observations appear to be an expensive and ineffective dud.  But it’s hard for an organization that spends $45 million on a project to scientifically validate classroom observations to admit that it failed.   It’s hard enough for a third-party evaluator to say that, let alone an in-house study about a key aspect of the Gates policy agenda.