(Guest Post by By Douglas J. McRae and Williamson M. Evers)
Did Smarter Balanced mishandle the bank of test-questions for 2017? Test scores dropped in virtually all states using Smarter Balanced national tests in their statewide testing programs in 2017.
States that used the Common Core-aligned Smarter Balanced tests showed English/Language Arts and Mathematics composite declines for 11 of the 14 states using these tests spring 2017, neither a loss nor a gain for 2 states, and a very modest gain for only a single state. Looking only at E/LA scores, there are declines for 13 states and only a tiny fraction of one percent gain for one state.
Tony Alpert, executive director for the Smarter Balanced Assessment Consortium, argues these results show that scores from its states are on a plateau. No, instead, there has been a substantial consortium-wide decline in scores.
We can compare the 2017 declines to consortium-wide composite score gains for 13 out of 14 Smarter Balanced states in 2016, and to composite gains for the parallel PARCC consortium for 2017 for all except one state. Both of these comparisons make the 2017 Smarter Balanced declines look like a sore thumb pointed downwards.
Yet Smarter Balanced continues to stonewall against releasing actual evidence or independent analysis of data contributing to the 2017 declines in test scores.
Alpert’s Jan. 26 opinion piece acknowledged for the first time that the 2017 item bank was changed from the 2016 item bank in a significant way. All information released by Smarter Balanced prior to Jan 26 indicated that the 2016 item bank was unchanged from 2015, and there was no public notice that the 2017 item bank was in fact modified from the 2016 item bank. This lack of transparency from Smarter Balanced adds to concerns that the 2017 declines may be traced to changes for the 2017 test-question bank.
Alpert says that the item bank was “similar between the two years.” Well, “similar” isn’t good enough for valid, reliable gain-scores from year-to-year or trend data over multiple years — certainly one of the major goals for any K-12 large-scale statewide testing program. We need more evidence than an assertion of similarity.
To generate valid reliable gain scores from year-to-year, a test maker has to document that changes made for any item bank do not change alignment to the academic content standards that are being measured, as well as not changing the coverage of what is in the blueprints for the test. In addition, the balance of easy-medium-hard test questions has to match the prior item bank, or scoring adjustments need to be made to reflect changes. This information has to be available before a modified item bank can be used for actual test administrations.
A glimpse into this information surfaced on Feb 9 in a document linked to an Education Week post on this issue. This link was to Smarter Balanced subcontractor (AIR) technical report dated Oct 2016 (but not made public by Smarter Balanced until recently). It includes an appendix on changes for item characteristics for Smarter Balanced operational item banks for 2016 and 2017; these charts showed the addition of more “easy” items for E/LA and Math for grades 3 and 4, and addition of more “hard” items for E/LA and Math for grades 5, 6, 7, and 8. This mix of additional items for the 2017 testing cycle indicates the 2017 item bank had more difficult items than the 2016 item bank, which unless Smarter Balanced adjusted their scoring specifications for the 2017 test, would be consistent with the decline in scores from 2016 to 2017 documented in late September 2017. Smarter Balanced has not released information to date on whether the scoring specifications were adjusted for differences in difficulty of the tests on a grade-by-grade basis for both E/LA and Math for the spring 2017 test administration cycle.
In addition, a test maker should monitor the item-by-item data for a revised item bank early during an actual test administration cycle to insure the new items added to the bank (for either replacing former items or expanding the size of the bank) are performing as anticipated, in order to further modify scoring rules as needed to ensure comparability of results from year-to-year. According to the Feb. 9 Education Week post, Smarter Balanced said they were now doing these analyses, long after-the-fact.
The Smarter Balanced lack of transparency for critical information on this issue is quite troubling. So far, Smarter Balanced has released no information confirming these routine test integrity activities were done prior to scoring and releasing spring 2017 test results for 5-6 million students from 14 states. Smarter Balanced has the behind-closed-doors data from the 2017 testing cycle; those data are required for informing any changes for the 2018 testing cycle which is already underway. If the 2017 data does not inform changes for the 2018 testing cycle, lack of routine comparability activities will affect Smarter Balanced annual gain scores and trend data for years into the future.
If Smarter Balanced has the evidence outlined above to justify their claim “we have every reason to believe that the scores accurately describe what students knew and were able to do in spring 2017,” then the professional thing to do was to make that information available concurrent with the release of 2017 test results in the fall of 2017, with perhaps more comprehensive analyses released before the beginning of the next test administration cycle.
Such transparency would have informed students, teachers, parents, school & state administrators, state policy makers, the media, and the public of the integrity of previously released test results and would have offered justification for any changes for the upcoming testing cycle. More transparency from Smarter Balanced would also allow independent experts to review and validate their currently behind-closed-doors data. What is Smarter Balanced hiding? Why isn’t it transparent like any other professional testing organization?
We repeat our call for Smarter Balanced to open the wall of secrecy for the information needed to investigate their 2017 consortium-wide score decline problem, and allow their claims to be examined by independent experts. The Smarter Balanced January 29 opinion piece falls woefully short of providing evidence their 2017 tests provided comparable scores from year-to-year upon which to conduct gain and trend analyses. We deserve better analysis and explanations from Smarter Balanced, along with much greater transparency for all parties interested in statewide K-12 test results across the country.
Douglas J. McRae is a retired educational-measurement specialist from Monterey, California. Williamson M. Evers is a research fellow at Stanford University’s Hoover Institution and a former U. S. Assistant Secretary of Education for Planning, Evaluation and Policy Development.
We can compare the SB trends with NAEP after April 10.
Not astonishing that a centralized command and control system would become this flagrantly dishonest and corrupt. A little astonishing it happened so fast. My guess is the national failure of CC has removed incentives to be careful to maintain honesty for the sake of reputation and not activating opposition. Would have happened inevitably anyway, but now there’s no point delaying the gratifications of dishonesty in order to reap benefits of honesty in a future that CC doesn’t have.
PS it’s never the crime, it’s always the cover-up.
[…] Entering the peak weeks of school testing season, pressure from students, educators and community leaders for genuine assessment reform is accelerating as more people recognize how much classroom learning time is undermined by standardized exam overkill. Multiple States Did Changed Test Questions Cause Decline in Smarter Balanced Test Scores https://jaypgreene.com/2018/03/26/did-changed-test-questions-cause-national-decline-in-smarter-balan… […]
I consider standardized testing a hoax. It does not measure academic growth. And the data does not help the teacher, help the student. Standardized testing is a Polaroid snapshot of cohorts during a given school year. What gives a false semblance of measurement is actually the comparison and contrast between different cohorts when they were in the (for example) same grade. And every time the assessment is changed, it degrades the ability to make an honest comparison and contrast.