Wolf and Witte Slam Ravitch on Milwaukee School Choice

January 18, 2013

Dwight Howard winning the 2008 Slam Dunk Contest.

As I’ve said before, I’m trying to avoid writing about Diane Ravitch because I think it’s now clear to all sensible people that she has gone completely nuts, lacks credibility, and was probablnever much of a scholar.  But I just can’t resist posting a link to the editorial my colleagues Pat Wolf and John Witte wrote today in the Milwaukee Journal Sentinel.  Wolf and Witte are responding to an earlier op-ed by Ravitch in which she declares:

Milwaukee needs one public school system that receives public dollars, public support, community engagement and parental involvement.

Vouchers and charters had their chance. They failed.

Wolf and Witte actually review the evidence on Milwaukee’s choice programs, including their own research.  They conclude:

Our research signals what likely would happen if Ravitch got her wish and the 25,000 students in the Milwaukee voucher program and nearly 8,000 children in independent charter schools were thrown out of their chosen schools. Student achievement would drop, as every student would be forced into MPS – the only game in town. Significantly fewer Milwaukee students would graduate high school and benefit from college. Parents would be denied educational choices for their children.

That’s not a future we would wish for the good people of Milwaukee.

There’s no point in trying to persuade Ravitch or her Army of Angry Teachers, since they abandoned rationality a long time ago.  But Wolf and Witte have done an excellent job of equipping sensible people with evidence that could help inform their views about school choice in Milwaukee.  Angry blather and bold (but false) declarations cannot compete with actual facts.

[Edited to correct typo in title.]


The Implications of a Blue Texas

January 17, 2013

(Guest Post by Matthew Ladner)

So I have been thinking about the talk of a “Blue Texas.” Texas has experienced a profound shift in partisan dominance within our lifetimes, and demographic changes in the state portend that it may happen again. Texas moved out of being part of the “Solid South” starting in the 1970s with the slow but steady rise of the Texas Republican party. Republicans had captured all of the statewide elected offices by the 1990s. Finally, the Republicans overcame Democratic gerrymandering to capture a majority in the Texas House and Senate in 2003.

A profound demographic shift has placed an expiration date upon the control of the Texas legislature by conservative Anglos. Conservatives may or may not remain ascendant in Texas but the days of the political dominance of conservative Anglos are certainly numbered.

One can see this trend coming in the ethnic distribution of the Texas school population. In 2011-12, Hispanics comprised 50.8% of children enrolled in the Texas public school system. Anglos comprised only 30.5 percent, and African-Americans only 12.8 percent. You can also get a sense of the scale and the growth in Texas by looking at public education statistics. With nearly five million students, Texas educates nearly as many public school students as the twenty smallest states combined. Texas may soon have twice as many public school students as Florida-despite the fact that Florida has the 4th largest public school population.  Texas has been adding a public school population roughly equal in size to the public school system of Wyoming every 14 months or so. Texas was the only state to gain 4 new Congressional seats after the 2010 Census- a small number of other states gained two, no one else gained either 3 or 4.

In 2012, Texas Hispanics comprised 25 percent of the electorate and favored Barack Obama over Mitt Romney by a margin of 62% to 37%. That’s a more balanced result than the national numbers, but hardly reassuring if you are a Texas Republican. Each passing year will see older Republicans passing on, and more young Hispanic voters entering the electorate. Some forecasters predict a “Blue Texas” by 2020- although it could happen either later or earlier or never depending upon a variety of factors.

Let’s start with the electoral college map. Republicans haven’t been very good at getting to 270 lately even with the now 38 Texas electoral votes in the bag. Without them states like Florida and Ohio could become mere style points for the Democratic nominee rather than crucial swing states. One could imagine other states trending Republican to counteract a Blue Texas, but it seems imaginary indeed.

For someone of modestly libertarian politics like myself, the most alarming scenario would be for a Blue Texas that becomes in effect a second California- a gigantic state in which organized public sector groups play an incredibly strong role in state policy making. I would expect that might blunt this momentum rather decisively:

Or perhaps not-predictions are hard, especially about the future. Some of you of course will be excited by the idea of a Blue Texas, others horrified by the prospect. Regardless the implications of a Blue Texas stretch far beyond Presidential politics. We can discuss some of those in future posts.

For now let’s keep an eye on this to see what happens next…


Starring Matt Ladner as the Difference Principle!

January 16, 2013

Hippies on stage

(Guest post by Greg Forster)

Are you ready for this? A Theory of Justice: The Musical!

No, really:

In order to draw inspiration for his magnum opus, John Rawls travels back through time to converse (in song) with a selection of political philosophers, including Plato, Locke, Rousseau and Mill. But the journey is not as smooth as he hoped: for as he pursues his love interest, the beautiful student Fairness, through history, he must escape the evil designs of his libertarian arch-nemesis, Robert Nozick, and his objectivist lover, Ayn Rand. Will he achieve his goal of defining Justice as Fairness?

Wait, I thought they already made that show. It was called Hair.

Here’s a publicity photo from the production – Matt Ladner in costume for his co-starring role as “The Difference Principle”:

ladnerhippie

HT David Koyzis


Shanker Institute Scholar Bounded in a Nutshell but Counts Himself a King of Infinite Space

January 15, 2013

(Guest Post by Matthew Ladner)

Matthew DiCarlo of the Shanker Institute has taken to reviewing the statistical evidence on the Florida K-12 reforms. DiCarlo reaches the conclusion that we ultimately can’t draw much in the way of conclusions regarding aggregate movement of scores.  He’s rather emphatic on the point:

In the meantime, regardless of one’s opinion on whether the “Florida formula” is a success and/or should be exported to other states, the assertion that the reforms are responsible for the state’s increases in NAEP scores and FCAT proficiency rates during the late 1990s and 2000s not only violates basic principles of policy analysis, but it is also, at best, implausible. The reforms’ estimated effects, if any, tend to be quite small, and most of them are, by design, targeted at subgroups (e.g., the “lowest-performing” students and schools). Thus, even large impacts are no guarantee to show up at the aggregate statewide level (see the papers and reviews in the first footnote for more discussion).

DiCarlo obviously has formal training in the statistical dark arts, and the vast majority of academics involved in policy analysis would probably agree with his point of view. What he lacks however is an appreciation of the limitations of social science.

Social scientists are quite rightly obsessed with issues of causality. Statistical training quickly reveals to the student that people are constantly making ad-hoc theories about some X resulting in some Y without much proof. Life abounds with half-baked models of reality and incomplete understandings of phenomena, which have a consistent and nasty habit of proving quite complex.

Social scientists have developed powerful statistical methods to attempt to establish causality techniques like random assignment and regression discontinuity can illuminate issues of causality. These types of studies can bring great value, but it is important to understand their limitations.

DiCarlo for instance reviews the literature on the impact of school choice in Florida. Random assignment school choice studies have consistently found modest but statistically significant test score gains for participating students. Some react to these studies with a bored “meh.” DiCarlo helps himself along in reaching this conclusion by citing some non-random assignment studies. More problematically, he fails to understand the limitations of even the best studies.

For example, even the very best random assignment school choice studies fall apart after a few short years. Students don’t live in social science laboratories but rather in the real world. Random lotteries can divide students into nearly identical groups with the main difference being that one group applied for but did not get to attend a charter or private school. They cannot however stop students in the control group from moving around.

Despite the best efforts of researchers, attrition immediately begins to degrade control groups in random assignment studies. Usually after three years, they are spent. Those seeking a definitive answer on the long-term impact of school choice on student test scores are in for disappointment. Social science has very real limits, and in this case, is only suggestive. Choice students tend to make small but cumulative gains year by year which tend to become statistically significant around year three, which is right around when the random assignment design falls apart. What’s the long-term impact? I’d like to know too, but it is beyond the power of social science to tell us, leading us to look for evidence from persistence rates.

So let’s get back to DiCarlo, who wrote “The reforms’ estimated effects, if any, tend to be quite small, and most of them are, by design, targeted at subgroups (e.g., the “lowest-performing” students and schools). Thus, even large impacts are no guarantee to show up at the aggregate statewide level.”  This is true but fails to recognize the poverty of the social science approach itself.

DiCarlo mentions that “even large impacts are no guarantee to show up at the aggregate statewide level.” This is a reference to the “ecological fallacy” which teaches us to employ extreme caution when travelling between the level of individual and aggregate level data. Read the above link if you want to know all the brutally geeky reasons why this is the case, take my word for it if you don’t.

DiCarlo is correct that connecting the individual level data (e.g. the studies he cites) to aggregate level gains is a dicey business. He however fails to appreciate the limitations of the studies he cites and the fact that the ecological fallacy problem cuts both ways. In other words, while generally positive, we simply don’t know the relationship between individual policies and aggregate gains.

We know for instance that we have a positive study on alternative certification and student learning gains. We do not and essentially cannot know however how many if any NAEP point gains resulted from this policy. The proper reaction for a practical person interested in larger student learning gains should be summarized as “who cares?” The evidence we have indicates that the students who had alternatively certified teacher made larger learning gains. Given the lack of any positive evidence associated with teacher certification, that’s going to be enough for most fair minded people.

FCAT 1

The individual impact of particular policies on gains in Florida is not clear. What is crystal clear however is the fact that there were aggregate level gains in Florida. You don’t require a random assignment study or a regression equation, for instance when considering the percentage of FCAT 1 reading scores (aka illiterate) above. When you see the percentage of African American students scoring at the lowest of five achievement levels drop from 41% to 26% on a test with consistent standards, it is little wonder why policymakers around the country have emulated the policy, despite DiCarlo’s skepticism.

I could go on and bomb you with charts showing improving graduation rates, NAEP scores, Advance Placement passing rates, etc. but I’ll spare you. The point is that there are very clear signs of aggregate level improvement in Florida, and also a large number of studies at the individual level showing positive results from individual policies.

The individual level results do not “prove” that the reforms caused the aggregate level gains. DiCarlo’s problem is that they also certainly do not prove that they didn’t. It has therefore been necessary from the beginning to examine other possible explanations for the aggregate gains. The problem here for skeptics is that the evidence weighs very much against them: Florida’s K-12 population became both demographically and economically more challenging since the advent of reform, spending increases were the lowest in the country since the early 1990s (see Figure 4) and other policies favored by skeptics come into play long after the improvement in scores began.

The problem for Florida reform skeptics, in short, is that there simply isn’t any other plausible explanation for Florida’s gains outside of the reforms. They flailed around with an unsophisticated story about 3rd grade retention and NAEP, unable and unwilling to attempt to explain the 3rd grade improvement shown above among other problems. One of NEPC’s crew once theorized that Harry Potter books may have caused Florida’s academic gains at a public forum. DiCarlo has moved on to trying to split hairs with a literature review.

With large aggregate gains and plenty of positive research, the reasonable course is not to avoid doing any of the Florida reforms, but rather to do all of them. In the immortal words of Freud, sometimes a cigar really is just a cigar.


Head Start Revealed

January 14, 2013

Despite the obvious effort to delay and conceal the disappointing results from the official and high quality evaluation of Head Start, the Wall Street Journal shines the light on the issue in today’s editorial.  DC’s manipulating scumbags might want to take note that efforts to hide negative research might just draw more attention.  It’s comforting to see that the world may sometimes look more like Dostoevsky’s Crime and Punishment than Woody Allen’s Crimes and Misdemeanors.

The Journal reveals that Head Start supporters have not only ignored the latest study, but they are trying to sneak an extra $100 million for Head Start into the relief package for victims of Hurricane Sandy.  They also note that the most recent disappointing Head Start result is just the latest in a string of studies failing to find benefits from the program despite a cumulative expenditure of more than $180 billion.

And then the Journal finishes with this:

The Department of Health and Human Services released the results of the most recent Head Start evaluation on the Friday before Christmas. Once again, the research showed that cognitive gains didn’t last. By third grade, you can’t tell Head Start alumni from their non-Head Start peers.

President Obama has said that education policy should be driven not by ideology but by “what works,” though we have to wonder given his Administration’s history of slow-walking the release of information that doesn’t align with its agenda.

In 2009, the Administration sat on a positive performance review of the Washington, D.C., school voucher program, which it opposes. The Congressionally mandated Head Start evaluation put out last month was more than a year late, is dated October 2012 and was released only after Republican Senator Tom Coburn and Congressman John Kline sent a letter to HHS Secretary Kathleen Sebelius requesting its release along with an explanation for the delay. Now we know what was taking so long.

Like so many programs directed at the poor, Head Start is well-intentioned, and that’s enough for self-congratulatory progressives to keep throwing money at it despite the outcomes. But misleading low-income parents about the efficacy of a program is cruel and wastes taxpayer dollars at a time when the country is running trillion-dollar deficits.

A government that cared about results would change or end Head Start, but instead Congress will use the political cover of disaster relief to throw more good money after proven bad policy.

[UPDATE: And here is a good follow-up op-ed on the study by Lindsey Burke on the Fox News web site.]


What Success Would Have Looked Like

January 10, 2013

Yesterday I described the Gates Foundation’s Measuring Effective Teachers (MET) project as “an expensive flop.”  To grasp just what a flop the project was, it’s important to consider what success would have looked like.  If the project had produced what Gates was hoping, it would have found that classroom observations were strong, independent predictors of other measures of effective teaching, like student test score gains.  Even better, they were hoping that the combination of classroom observations, student surveys, and previous test score gains would be a much better predictor of future test score gains (or of future classroom observations) than any one of those measures alone.  Unfortunately, MET failed to find anything like this.

If MET had found classroom observations to be strong predictors of other indicators of effective teaching and if the combination of measures were a significantly better predictor than any one measure alone, then Gates could have offered evidence for the merits of a particular mixing formula or range of mixing formulas for evaluating teachers.  That evidence could have been used to good effect to shape teacher evaluation systems in Chicago, LA, and everywhere else.

They also could have genuinely reassured teachers anxious about the use of test score gains in teacher evaluations.  MET could have allayed those concerns by telling teachers that test score gains produce information that is generally similar to what is learned from well-conducted classroom observations, so there is no reason to oppose one and support the other.  What’s more, significantly improved predictive power from a mixture of classroom observations with test score gains could have made the case for why we need both.

MET was also supposed to have helped us adjudicate among several commonly used rubrics for classroom observations so that we would have solid evidence for preferring one approach over another.  Because MET found that classroom observations in general are barely related to other indicators of teacher effectiveness, the study told us almost nothing about the criteria we should use in classroom observations.

In addition, the classroom observation study was supposed to help us identify the essential components of effective teaching .  That knowledge could have informed improved teacher training and professional development.  But because MET was a flop (because classroom observations barely correlate with other indicators of teacher effectiveness and fail to improve the predictive power of a combined measure), we haven’t learned much of anything about the practices that are associated with effective teaching.  If we can’t connect classroom observations with effective teaching in general, we certainly can’t say much about the particular aspects of teaching that were observed that most contributed to effective teaching.

Just so you know that I’m not falsely attributing to MET these goals that failed to be realized, look at this interview from 2011 of Bill Gates by Jason Riley in the Wall Street Journal.  You’ll clearly see that Bill Gates was hoping that MET would do what I described above.  It failed to do so.  Here is what the interview revealed about the goals of MET:

Of late, the foundation has been working on a personnel system that can reliably measure teacher effectiveness. Teachers have long been shown to influence students’ education more than any other school factor, including class size and per-pupil spending. So the objective is to determine scientifically what a good instructor does.

“We all know that there are these exemplars who can take the toughest students, and they’ll teach them two-and-a-half years of math in a single year,” he says. “Well, I’m enough of a scientist to want to say, ‘What is it about a great teacher? Is it their ability to calm down the classroom or to make the subject interesting? Do they give good problems and understand confusion? Are they good with kids who are behind? Are they good with kids who are ahead?’

“I watched the movies. I saw ‘To Sir, With Love,'” he chuckles, recounting the 1967 classic in which Sidney Poitier plays an idealistic teacher who wins over students at a roughhouse London school. “But they didn’t really explain what he was doing right. I can’t create a personnel system where I say, ‘Go watch this movie and be like him.'”

Instead, the Gates Foundation’s five-year, $335-million project examines whether aspects of effective teaching—classroom management, clear objectives, diagnosing and correcting common student errors—can be systematically measured. The effort involves collecting and studying videos of more than 13,000 lessons taught by 3,000 elementary school teachers in seven urban school districts.

“We’re taking these tapes and we’re looking at how quickly a class gets focused on the subject, how engaged the kids are, who’s wiggling their feet, who’s looking away,” says Mr. Gates. The researchers are also asking students what works in the classroom and trying to determine the usefulness of their feedback.

Mr. Gates hopes that the project earns buy-in from teachers, which he describes as key to long-term reform. “Our dream is that in the sample districts, a high percentage of the teachers determine that this made them better at their jobs.” He’s aware, though, that he’ll have a tough sell with teachers unions, which give lip service to more-stringent teacher evaluations but prefer existing pay and promotion schemes based on seniority—even though they often end up matching the least experienced teachers with the most challenging students.

The final MET reports produced virtually nothing that addressed these stated goals.  But in Orwellian fashion, the Gates folks have declared the project to be a great success.  I never expected MET to work because I suspect that effective teaching is too heterogeneous to be captured well by a single formula.  There is no recipe for effective teaching because kids and their needs are too varied, teachers and their abilities are too varied, and the proper matching of student needs and teacher abilities can be accomplished in many different ways.  But this is just my suspicion.  I can’t blame the Gates Foundation for trying to discover the secret sauce of effective teaching, but I can blame them for refusing to admit that they failed to find it.  Even worse, I blame them for distorting, exaggerating, and spinning what they did find.

(edited for typos)


Expulsion Rates in DC

January 10, 2013

(Guest Post by Matthew Ladner)

The Washington Post has an important story up about expulsion rates in DC district and charter schools.  I can’t figure out how to embed anything but Youtube videos so the link is here.

Go watch it.

I’ll be here when you get back.

Go on…

Ok good. One important item to note: if we were to go and look up the criminal incident reports we would quickly conclude that the expulsion rate in DCPS is far too low.  If I wanted to be cruel, I’d go dig up the crime data. The video specifies that DCPS expelled three students last year, while the charter schools expelled 200.

It seems self-evident to me that 3 was far too low, and it is difficult to know whether 200 is “too many” for the charter sector without a great deal more context.  A district where you have to make the FBI Most Wanted List before getting expelled is not a proper baseline for comparison.

Discuss amongst yourselves…


Understanding the Gates Foundation’s Measuring Effective Teachers Project

January 9, 2013

If I were running a school I’d probably want to evaluate teachers using a mixture of student test score gains, classroom observations, and feedback from parents, students, and other staff.  But I recognize that different schools have different missions and styles that can best be assessed using different methods.  I wouldn’t want to impose on all schools in a state or the nation a single, mechanistic system for evaluating teachers since that is likely to be a one size fits none solution.  There is no single best way to evaluate teachers, just like there is no single best way to educate students.

But the folks at the Gates Foundation, afflicted with PLDD, don’t see things this way.  They’ve been working with politicians in Illinois, Los Angeles, and elsewhere to centrally impose teacher evaluation systems, but they’ve encountered stiff resistance.  In particular, they’ve noticed that teachers and others have expressed strong reservations about any evaluation system that relies too heavily on student test scores.

So the folks at Gates have been trying to scientifically validate a teacher evaluation system that involves a mix of test score gains, classroom observations, and student surveys so that they can overcome resistance to centrally imposed, mechanistic evaluation systems.  If they can reduce reliance on test scores in that system while still carrying the endorsement of “science,” the Gates folk imagine  that politicians, educators, and others will all embrace the Gates central planning fantasy.

Let’s leave aside for the moment the political reality, demonstrated recently in Chicago and Los Angeles, that teachers are likely to fiercely resist any centrally imposed, mechanistic evaluation system regardless of the extent to which it relies on test scores.  The Gates folks want to put on their lab coats and throw the authority of science behind a particular approach to teacher evaluation.  If you oppose it you might as well deny global warming.  Science has spoken.

So it is no accident that the release of the third and final round of reports from the Gates Foundation’s Measuring Effective Teachers project was greeted with the following headline in the Washington Post: “Gates Foundation study: We’ve figured out what makes a good teacher,”  or this similarly humble claim in the Denver Post: “Denver schools, Gates foundation identify what makes effective teacher.”  This is the reaction that the Gates Foundation was going for — we’ve used science to discover the correct formula for evaluating teachers.  And by implication, we now know how to train and improve teachers by using the scientifically validated methods of teaching.

The only problem is that things didn’t work out as the Gates folks had planned.  Classroom observations make virtually no independent contribution to the predictive power of a teacher evaluation system.  You have to dig to find this, but it’s right there in Table 1 on page 10 of one of the technical reports released yesterday.  In a regression to predict student test score gains using out of sample test score gains for the same teacher, student survey results, and classroom observations, there is virtually no relationship between test score gains and either classroom observations or student survey results.  In only 3 of the 8 models presented is there any statistically significant relationship between either classroom observations or student surveys and test score gains (I’m excluding the 2 instances were they report p < .1 as statistically significant).  And in all 8 models the point estimates suggest that a standard deviation improvement in classroom observation or student survey results is associated with less than a .1 standard deviation increase in test score gains.

Not surprisingly, a composite teacher evaluation measure that mixes classroom observations and student survey results with test score gains is generally no better and sometimes much worse at predicting out of sample test score gains.  The Gates folks trumpet the finding that the combined measures are more “reliable” but that only means that they are less variable, not any more predictive.

But “the best mix” according to the “policy and practitioner brief” is “a composite with weights between 33 percent and 50 percent assigned to state test scores.”  How do they know this is the “best mix?”  It generally isn’t any better at predicting test score gains.  And to collect the classroom observations involves an enormous expense and hassle.  To get the measure as “reliable” as they did without sacrificing too much predictive power, the Gates team had to observe each teacher at least four different times by at least two different coders, including one coder outside of the school.  To observe 3.2 million public school teachers for four hours by staff compensated at $40 per hour would cost more than $500 million each year.  The Gates people also had to train the observers at least 17 hours and even after that had to throw out almost a quarter of those observers as unreliable.  To do all of this might cost about $1 billion each year.

And what would we get for this billion?  Well, we might get more consistent teacher evaluation scores, but we’d get basically no improvement in the identification of effective teachers.  And that’s the “best mix?”  Best for what?  It’s best for the political packaging of a centrally imposed, mechanistic teacher evaluation system, which is what this is all really about.  Vicki Phillips, who heads the Gates education efforts, captured in this comment what I think they are really going for with a composite evaluation score:

Combining all three measures into a properly weighted index, however, produced a result “teachers can trust,” said Vicki Phillips, a director in the education program at the Gates Foundation.

It’ll cost a fortune, it doesn’t improve the identification of effective teachers, but we need to do it to overcome resistance from teachers and others.  Not only will this not work, but in spinning the research as they have, the Gates Foundation is clearly distorting the straightforward interpretation of their findings: a mechanistic system of classroom observation provides virtually nothing for its enormous cost and hassle.  Oh, and this is the case when no stakes were attached to the classroom observations.  Once we attach all of this to pay or continued employment, their classroom observation system will only get worse.

I should add that if classroom observations aren’t useful as predictors, they also can’t be used effectively for diagnostic purposes.  An earlier promise of this project is that they would figure out which teacher evaluation rubrics were best and which sub-components of those rubrics that were most predictive of effective teaching.  But that clearly hasn’t panned out.  In the new reports I can’t find anything about the diagnostic potential of classroom observations, which is not surprising since those observations are not predictive.

So, rather than having “figured out what makes a good teacher” the Gates Foundation has learned very little in this project about effective teaching practices.  The project was an expensive flop.  Let’s not compound the error by adopting this expensive flop as the basis for centrally imposed, mechanistic teacher evaluation systems nationwide.

(Edited for typos and to add links.  To see a follow-up post, click here.)


Talking ESAs on RedefinED

January 8, 2013

(Guest Post by Matthew Ladner)

Over at RedefinED Ron Matus and I discuss ESAs as a new type of school choice program.


Happy New Year

January 2, 2013

(Guest Post by Matthew Ladner)

Ed Week’s Sean Cavanaugh looks back at the school choice world of 2012 and looks ahead to 2013. Well worth a read.