The weak predictive power of test scores

Here’s my first round in the debate with Mike Petrilli over whether test scores are reliable indicators of quality that can be used by regulators and policymakers to identify schools to be closed or expanded…

——————————————————-

The school choice tent is much bigger than it used to be. Politicians and policy wonks across the ideological spectrum have embraced the principle that parents should get to choose their children’s schools and local districts should not have a monopoly on school supply.

But within this big tent there are big arguments about the best way to promote school quality. Some want all schools to take the same tough tests and all low-performing schools (those that fail to show individual student growth over time) to be shut down (or, in a voucher system, to be kicked out of the program). Others want to let the market work to promote quality and resist policies that amount to second-guessing parents.

In the following debate, Jay Greene of the University of Arkansas’s Department of Education Reform and Mike Petrilli of the Thomas B. Fordham Institute explore areas of agreement and disagreement around this issue of school choice and school quality. In particular, they address the question: Are math and reading test results strong enough indicators of school quality that regulators can rely on them to determine which schools should be closed and which should be expanded—even if parental demand is inconsistent with test results?

To a very large degree, education reform initiatives hinge on the belief that short term changes in reading and math achievement test results are strong predictors of long term success for students. We use reading and math test scores to judge the quality of teachers, schools, and the full array of pedagogical, curricular, and policy interventions. Math and reading test scores are the yardstick by which education reform is measured. But how good of a yardstick is it?

Despite the centrality of test scores, there is surprisingly little rigorous research linking them to the long-term outcomes we actually care about. The study by researchers from Harvard and Columbia (Chetty, et al.) showing that teachers who increase test scores improve the later-life earnings of their students is a notable exception to the dearth of evidence on this key assumption of most reform initiatives. But that is one study, it has received some methodological criticism (although I think that has been addressed to most people’s satisfaction), and its results from low-stakes testing may not apply to the high-stakes purposes for which we would now like to use them. This seems like a very thin reed on which to rest the entire education reform movement.

In addition, we have a growing body of rigorous research showing a disconnect between improving test scores and improving later-life outcomes. I’ve written about this at greater length elsewhere (see here and here), but we have eight rigorous studies of school choice programs in which the long-term outcomes of those policies do not align with their short-term achievement test results. In four studies, charter school programs that produce impressive test score gains appear to yield no or little improvement in educational attainment. In three studies of private school choice and one charter school choice program, we observe large benefits in educational attainment and even earnings but little or no gains in short-term test score measures.

If policy analysts and the portfolio managers, regulators, and other policy makers they advise were to rely primarily on test scores when deciding which programs or schools to shutter and which to expand, they would make some horrible mistakes. Even if we ignore the fact that most portfolio managers, regulators, and other policy makers rely on the level of test scores (rather than gains) to gauge quality, math and reading achievement results are not particularly reliable indicators of whether teachers, schools, and programs are improving later-life outcomes for students.

What explains this disconnect between math and reading test score gains and later-life outcomes? First, achievement tests are only designed to capture a portion of what our education system hopes to accomplish. In particular, they are not designed to measure character or non-cognitive skills. A growing body of research is demonstrating that character skills like conscientiousness, perseverance, and grit are important predictors of later-life success (see this, for example). And more recent research by Matt KraftKirabo Jackson, and Albert Cheng and Gema Zamarro (among others) shows that teachers, schools, and programs that increase character skills are not necessarily the same as those that increase achievement test results. There are important dimensions of teacher, school, and program quality that are not captured by achievement test results. Second, math and reading achievement tests are not designed to capture what we expect students to learn in other subjects, such as science, history, and art. Prioritizing math and reading at the expense of other subjects that may be important for students’ later-life success would undermine the predictive power of those math and reading results. Third, many schools are developing strategies for goosing math and reading test scores in ways that may not contribute to (and may even undermine) later-life success. The fact that math and reading achievement results are overly narrow and easily distorted makes them particularly poor indicators of quality and weak predictors of later-life outcomes.

I do not mean to suggest that math and reading test results provide us with no information or that we should do away with them. I’m simply arguing that these tests are much less reliable indicators of quality than most policy analysts, regulators, and policy makers imagine. We should be considerably more humble about claiming to know which teachers, schools, and programs are good or bad based on an examination of their test scores. If parents think that certain teachers, schools, and programs are good because there is a waiting list demanding them, we should be very cautious about declaring that they are mistaken based on an examination of test scores. Even poorly educated parents may have much more information about quality than analysts and regulators sitting in their offices looking at spreadsheets of test scores.

I also do not mean to suggest that policy makers should never close a school or shutter a program in the face of parental demand. I’m just arguing that it should take a lot more than “bad” test scores to do that. Yes, parents can and will make mistakes. But analysts, authorizers, regulators, and other policy makers also make mistakes, especially if they rely predominantly on test results that are, at best, weak predictors of later-life success. The bar should be high before we are convinced that the parents are mistaken rather than the regulators poorly guided by test scores. Besides, we should prefer letting parents make mistakes for their own children over distant bureaucrats making mistakes for hundreds or thousands of children while claiming to protect them.

(Also posted at Flypaper )

6 Responses to The weak predictive power of test scores

  1. Greg Forster says:

    People need to understand that a single study by itself establishes nothing, except perhaps an agenda for future study. Science is an iterative process. Such “proof” as we are able to produce comes from multiple studies testing and retesting theories, and (at least as important) scientists debating the interpretation of the results.

    In the past I used to cite Chetty as establishing the value of test scores. I wish I’d been more circumspect.

  2. Frederick ROBERTS says:

    Is determining whether a student has learned to read, to write a coherent sentence, to use mathematical reasoning impossible with a written test?

    If true, then how are such competences to be verified or shown to be lacking?

    And why can countries like Australia, Canada, Singapore, Germany, England, France, and Russia determine whether a student gained or failed to gain such skills? Huhhhhh?!

    • Greg Forster says:

      Certainly those three competencies, in individual students, can be measured with tests (although the reliability of writing tests remains a more open question than that of reading and math tests). The questions Jay is raising are whether it’s sufficient to judge schools so overwhelmingly on these competencies alone, and whether the political and bureaucratic structures needed for such judgment will be undermined by various perverse incentives.

  3. Frederick ROBERTS says:

    Reading by year-end would be a 1st Grade competency for most. Few at that age would be equal to ‘War and Peace’, but should be able to read engaging stories with one or more points. Saying what those were might be a stretch. Beginning readers tend to like stories that amuse or intrigue them more than anything else.

    Being able to write coherent sentences should emerge by end of 3rd Grade and certainly by end of 4th. These would be simple and compound sentences, which should not be dictated or learned thru drilling for test-day.

    By end of 4th Grade, a child should know how to add and subtract sums, fractions, to do decimals, to know times tables, percentages, ratios, proportions, ratios, percentages, have basic grasp of powers and roots, and know basic number properties.

    Do these competences ask too much of basic, compulsive, mass education?

    Any child lacking them would be behind-the-eight-ball, though he might be ahead or right in the middle of his peers. Whether a child is behind the pack is good information to know and should be uncovered much sooner end of Grade 4.

    If the child is found behind the pack, EARLY in Grade 1, what then? Send in back-up. If the whole class is behind, learn whether the teacher is not trained or her/his class were not yet school-ready, were an unrepresentative sample of mostly below average students, or whether non-school factors hold them back. Waiting until end of Grade 1 is indefensible and waiting until the end of Grade 4 should be punishable. Costs to society are abs. huge

    Without individual oral tests given by independent testers or written ones taken by whole classrooms, how does one know whether a child has learned these things? Surely, not with a wetted finger raised to the wind. Compulsive, mass public education cannot give individual oral tests. Faking those results would tempt most teachers too much. So written ones have to do. Penalties against teachers and administrators who collude in faking results should have real teeth.

    From Grade 7 thru Grade 12, I attended a Classical school, where we took five full-time subjects a year. Languages got great stress, but we also took English, History, Geography, and Math.

    We took written tests in each subject weekly. Those tests were the basis of our monthly report cards, which parents had to sign–like our marks or not.

    This was a public school, whose graduates all went on to college-real ones too!

  4. […] written several times recently about how short term gains in test scores are not associated with improved later life outcomes for […]

Leave a comment