The Dark Days of Educational Measurement in the Sunshine State Ended in 1999

(Guest Post by Matthew Ladner)

Over on the Shanker Blog of the American Federation for Teachers, Matthew DiCarlo writes a thoughtful but ultimately misguided post A Dark Day for Education Measurement in the Sunshine State.

DiCarlo is obviously very bright, but a few critical misinterpretations have led him astray. DiCarlo demonstrates that family income is highly correlated with student test scores in Florida. No surprise- the same is true everywhere.

Having demonstrated this, DiCarlo develops a critique of Florida’s school grading system. The Florida school grading system carefully balances overall performance on state exams with academic growth over time. Specifically, the formula weights student proficiency on state exams as 50% of a schools grade, 25% on the growth of all students, and the final 25% on the growth of students who scored in the bottom quartile on last year’s exam.

The last bit is the clever part of the formula. By double weighting the gains of students who are behind, they become the most important children in the building. Only the bottom quartile from last year’s test count in all three categories.

DiCarlo goes into the devilish details about how the state determines these gains, and concludes that some of the gains measures don’t actually measure academic growth but actually effectively measure academic proficiency. The use of proficiency levels in determining gains is critical because students are taking a higher grade level assessment with more rigorous content. If a student achieves a proficient score on the eighth grade FCAT and then again on the ninth grade FCAT, the student is performing at a higher level because the content is more difficult. Florida’s system does not provide credit for a learning gain for students performing Advanced in one year but Proficient the next year.

DiCarlo has failed to appreciate that the mastery of more challenging academic material from one grade to the next itself constitutes a form of academic growth.

The 9^th grade student has now studied the mathematics curriculum of both 8^th and 9^th grade and has demonstrated proficiency of the 8^th grade material and proficiency of the 9^th grade material. Given the valid system of testing, we can feel assured that the 9^th grader knows more about math than he or she knew as an 8^th grader. The growth in this case is staying on track in a progressively more challenging sequential curriculum.

The Florida system, in essence, makes use of proficiency levels in order to give definition to gains and drops as meaningful. There of course is no “correct” way to structure such a system, and if 100 different people examined any given system they would likely have 500 different suggestions for improvement to match their preferences.

DiCarlo’s notion of “fairness” seems to have distracted him from a far larger and more important issue: the utility of the Florida grading system, seen best at the school grading level, has improved student achievement for all students.

If you go back as far as the FCAT data system will take you for results by Free and Reduced lunch eligibility for 3^rd grade reading, you’ll find that in 2002 48% of Florida’s free and reduced lunch students scored FCAT 3 or better. In the most recent data available from 2010, 64% scored FCAT 3 or better. That is an enormous improvement in the percentage of students scoring at grade level or better.

In 2002, 60% of all Florida students scored Level 3 or above, and in 2010, 72% scored Level 3 or above. Free and reduced lunch eligible kids in 2010 outperformed ALL kids in 2002 by 4 percentage points. That’s real progress. And the free and reduced lunch eligible children overtake the 2002 general population averages in a large majority of grades tested.

The same pattern can be found in Florida’s NAEP data. For instance, in 1998, 48% of Florida’s free and reduced lunch eligible students scored “Below Basic” on the NAEP 8^th grade reading test. In 2011, that number had fallen to 35%. If an “unfair” system helps to produce a 27% decline in the illiteracy rate among low-income students, I’d like to order up a grave injustice.

The “Dark Days of Education Measurement in Florida” in my view were before school grades. Academic failure lied concealed behind a fog of fuzzy labels, and Florida wallowed near the bottom of the NAEP exams. Back when there was little transparency and even less accountability, far more students failed to acquire the basic academic skills needed to succeed in life. While perhaps a lost golden age for educators and administrators wishing to avoid any responsibility for academic outcomes, it was a Dark Age for students, parents and taxpayers.

Ironically, DiCarlo has decried a system which has weakened the link between family income and academic outcomes demonstrated in his post. Yes it is still strong in Florida, but it used to be much, much stronger.

Finally, can one truly complain about the “fairness” of a system providing more than ten times as many A/B grades as D/F grades? If anything, the Florida school grading system has grown too soft in my view (see chart above).

I’ve read enough of DiCarlo’s work to know that he is a thoughtful person, so I hope he will examine the evidence for himself and reconsider his stance. I don’t have any reason to think that the Florida system is perfect. I don’t think a perfect system exists, and I suspect that there are some changes to the Florida system that DiCarlo and I might actually agree on.

It seems however difficult to argue that the Florida system hasn’t been useful if one gives appropriate weight to the interests of students, parents and taxpayers to balance those of school staff.

This entry was posted on Wednesday, February 8th, 2012 at 10:04 pm and is filed under school accountability. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

4 Responses to The Dark Days of Educational Measurement in the Sunshine State Ended in 1999

Matt Di Carlo says:

February 9, 2012 at 11:43 am

Hi Matthew,

Thanks for your post, and kind words. Five points:

First, just to clarify, Shanker Blog is published by the Albert Shanker Institute, not AFT. The organizations are related, but distinct. Information on the Institute and a list of its Board of Directors can be found here: http://www.shankerinstitute.org/about/

Second, notes of agreement: I fully acknowledge that these state systems will never be perfect (and we wouldn’t know it even if they were). There will always be aspects to quibble with, and I say as much in the post. And, while I personally think Florida’s system is the worst I’ve seen, there is plenty of room for reasonable disagreements between reasonable people, and I appreciate your taking the time to engage thoughtfully. We are also in accord that the partial focus on the lowest-performing students is a good thing (which I also note in the post), as is the fact that Florida limits its calculations to students who are in the sample in both years. Finally, I am not at all against rating schools and districts – quite the contrary, like yourself, I support the idea in theory, which is precisely why I take the time to critique the systems. If I thought they were inherently useless or bad, I wouldn’t bother.

Third, my objections to the Florida system have much less to do with “fairness” than with ineffective policy, though, as is often the case, the former is a reflection of the latter. My examination of the bias in the final ratings by student characteristics is not an attempt to argue that poorer districts will be treated unfairly (though that’s true), but rather that districts will be rewarded and punished based on ratings that have little or nothing to do with their actual performance – i.e., ability to increase student performance. That’s bad policy, and that’s my concern.

Fourth, and most substantively, I’m afraid your definition of “growth” doesn’t make much sense to me. Virtually all students “know more about math” after completing a grade, relative to how much they knew at the end of the previous year. The whole idea of growth as typically understood is testing gains above and beyond some benchmark.

Furthermore, even if I accepted your alternative definition of “growth,” it would still conflict with Florida’s system. In other words, even if, as you argue, students who are proficient in math in two successive years have demonstrated “growth,” you would still need to explain why, according to Florida’s system, a student who is “basic” or “below basic” in two successive years hasn’t done the same. Putting aside the horrible issues with cutpoint-based measures, that student has also made roughly a year’s worth of progress, just from a lower starting point, and he or she certainly “knows more about math” than the year before. Why isn’t this “growth” too?

Now, if Florida considered it “growth,” that would probably eliminate the bias, but only because virtually all students would be coded as “making gains” – i.e., the vast majority would either remain in the same category or go up between years. Put bluntly, if you define “growth” in a manner that doesn’t actually require an increase in performance, virtually everyone will grow.

In contrast, what Florida does is apply that standard – growth without the need for growth as conventionally understood – to some students and not others. Students who aren’t proficient in year one actually have to demonstrate progress (move up a category or increase their raw score by more than a year’s worth of learning), whereas students who are proficient (or above) don’t – they can just stay constant (or even decrease in their score, so long as they aren’t bumped down).

As a result, the vast majority of the students proficient in year one will be coded as “making gains.” regardless of whether or not they actually made gains (as I define them). As I show in the post, there is relatively little variation between the two measures – much of the “growth” measure (for all students) is conflated with absolute performance. That means the distribution of proficient students across districts, rather than the actual quality of instruction, will largely drive the percent of students “making gains.” They are both mostly measures of absolute performance level.

If you want to argue that proficient students *should* be held to a different standard because they’re already where they need to be, you can make that case, but it is not consistent with the purpose of a system rating *school* performance. It’s growth conditional on performance level, which, again, means that the ratings will measure student performance, not schools’. That might be useful to parents (e.g., peer effects), but it’s terrible for the purpose of policy decisions.

This shows up very clearly in the results: Since three-quarters of districts’ grades (the 25% for all students making gains and the 50% for absolute rates) are based largely on absolute performance, virtually every high-income district received an “A” and virtually every low-income district received a poor grade. To the degree that these grades are used in any actual decisions, they will be rewarding and punishing schools based mostly on their students’ characteristics, not on how well they boost the performance of those students. Again, that’s bad policy.

Fifth and finally, you argue that Florida’s school/district grade system is responsible for the trend in state and NAEP scores that you illustrate. I make a habit of not putting forth causal arguments about trends in testing results when causality is not tested, especially when the data are cross-sectional. I’m wondering why you rely on purely descriptive data, rather than the couple of multivariate analyses that address this issue. For example, the study by Hanushek/Raymond seems most relevant here, since they separate the “effect” of reporting results from that of using them. They find that attaching consequences to schools’ results may explain a small part of states’ raw NAEP trends (such as the Florida results you present). But they also show that simply reporting results publicly (e.g., a district grading system) has no effect.

Although I’m always more than a little cautious when it comes to this sort of state-level analysis, these results do not quite support your argument. It’s not the grading systems per se that matter, it’s how you use them. And, even if it was the grades themselves driving the change (after all, you need to measure performance in order to hold schools accountable for it), that would only mean that the benefits would be greater if the measures were more rigorous.

Florida’s raw performance results are almost certainly a combination of a whole bunch of interconnected factors, school and non-school (including, perhaps, changes in student demographics). If you are speculating that a system assigning grades to schools and districts in Florida is the primary cause of these trends, then that’s up to you, but I do not find that argument at all compelling.

So, overall, I’m afraid I must remain astray in this case – though, as you seem to hint, I suspect (but am not certain) that we largely agree on the shortcomings of many of this particular system’s details. That the system is bad doesn’t mean Florida or accountability in general is bad.

Thanks again for your response, and I’m sorry this is longer than your post (and, most likely, my original piece as well).

MD

Reply
matthewladner says:

February 9, 2012 at 3:17 pm

Matt-

Thanks for your response, and duly noted concerning the Shanker Institute. You raise more points than I will be immediately able to address, but I will briefly address the issue of causality. Florida did indeed undertake a number of different reforms at the same time starting in 1999. While we do have a number of individual empirical evaluations concerning particular policies, this doesn’t allow us to draw any firm conclusions at the aggregate level concerning how much improvement can be ascribed to any particular policy. Florida’s reforms didn’t unfold in a random assignment study but rather in the real world. Campbell and Stanley would rightly note that we have no ability to “control for history” in forming our opinions, so we are left to gather as much evidence as possible to inform our decisions until more evidence comes available.

The firmly held opinion of the people who fought to put this system in place is that the grading system served as a crucial lynchpin in the improvement of results. They could be wrong about that, but I haven’t seen anything to make me doubt it. A number of other possible explanations, including demographic change and spending, however are easily dismissed by gathering some basic data:

J.K. Rowling: The Jeb Bush of the NEPC Florida Fantasy

I can reasonably predict that your inner statistician abhors such a messy reality. Mine does too, but it is what it is.

On growth, if a kid scores below basic in one year and below basic the next year, it seems entirely possible that the kid didn’t know anything about math in either year. It would be rather perverse to have it possible to literally score 0% correct in one year, zero percent the next year and get credit for growth.

On the other hand, if he or she scores proficient in one year and proficient in the next it seems credible to me that the student has gained knowledge of math in a progressively more challenging set of curriculum. On your point about basic to basic, I’m not a testing expert and I don’t even play one on television, but the possibility of a larger compounding of gaps in knowledge seems likely, at least in comparison to the proficient to proficient example.

I’m not sure if we ought to expect students to jump from the proficient to advanced levels as easily as from basic to proficient per se, but I wouldn’t be shocked if there was a good reason not to expect it. Ceiling effects would seem much more likely to come into play in the thin air of advanced achievement.

All of these things are judgement calls, and I suspect that we might agree that the system needs adjustment at the higher end of things. Given the large improvements in NAEP scores for disadvantaged Florida students, however, if Florida has “the worst” system, I’m eager to see the best.

Reply
Retention idea works, advocates tell panel « says:

February 14, 2012 at 3:27 pm

[…] The Dark Days of Educational Measurement in the Sunshine State Ended in 1999 (jaypgreene.com) […]

Reply
Special Education – Empowers Special Students » College Distance Education says:

February 16, 2012 at 7:46 pm

[…] Jay P. Greene's Blog #igit_rpwt_css { background:#FFFFFF; font-family:verdana,arila,serif; font-size:12px; font-style:normal; color:#000000 !important; margin-top:10px;margin-bottom:10px; } #igit_rpwt_css h4{ color:#000000;font-family:Arial;} #igit_title { padding:2px 2px 2px 0px; font-family:Arial; font-size:12px;} #igit_title a { color:#000000; font-family:Arial; font-size:12px;} #igit_rpwt_thumb, #description { margin-left:0px; } .igit_relpost:hover { background-color:#DDDDDD;} #igit_rpwt_main_image { float:left; height:105px; line-height:15; padding-bottom:10px; padding-right:2px; padding-top:2px; text-align:left; width:120px; } #igit_rpwt_css ul { margin:0; } #igit_rpwt_li { cursor:pointer; list-style:none; border-bottom:1px solid #EBDDE2; padding: 5px 5px 10px 10px !important; } #igit_rpwt_li:hover{background:#DDDDDD;} div.simplesocial,a.simplesocial{float:left;display:block}a.simplesocial{margin-right:5px;width:32px;height:32px}a.simplesocial:hover{margin-top:-2px} function simplesocial(t,w,h){ window.open(t.href, 'simplesocial', 'scrollbars=1,menubar=0,width='+w+',height='+h+',resizable=1,toolbar=0,location=0,status=0,left='+(screen.width-w)/2+',top='+(screen.height-h)/3); return false;} #header_stallion_2011_top { position: absolute !important; clip: rect(1px 1px 1px 1px); /* IE6, IE7 */ clip: rect(1px, 1px, 1px, 1px); } .broken_link, a.broken_link { text-decoration: line-through; } […]

Reply

Jay P. Greene's Blog