Why Test Scores are ALMOST Useless to Me

If you've been following the conversation on the Radical recently (see here and here), you'll know that we've been collectively wrestling with the place that the magic bubbles (read: standardized test scores) should have in assessing student understandings. 

One question that I haven't had a chance to answer yet was posed by regular reader K. Borden.  She wrote:

If that
rising sixth grader were entering your class, would you want only her
former teachers’ observations or would you want those test results as
well? Which would serve you best in providing her an opportunity to be
an enthusiastic, confident and competent learner?

Good question, K, and one that has had me thinking over the past few weeks as our school's end of grade test scores have come back from Scantron Central

Here's my response:  When talking about my initial attempts to identify the strengths and weaknesses of the students entering my classroom, test scores are almost completely useless to me.  I have little confidence in them as a measure of an individual child's abilities and—given the choice—would take observations by both parents and teachers in every circumstance.

That's not a very tempered response, is it?! 

But it's a response built from the understanding that test scores for individual children can change dramatically from one administration to the next with no apparent explanation.  I first learned this lesson while watching the testing results of a boy—-let's call him Jamison—-that I tutored several years ago.

Like many of the students I tutor, Jamison was the prototypical middle school boy.  He was active and funny, but unpredictable on a good day!  When he was on, he was brilliant—engaged in deep and meaningful conversations about world events that would force the thinking of every other child in his class.  When he was off, he'd be throwing his shoes across the room just because he thought it was funny. 

When Jamison took the reading end of grade exam for the first time, his scale score—-the primary indicator of a child's progress from one year to the next—-dropped by something like 13 points.  He went from a Level 3, which demonstrates grade-level mastery of material, to a low Level 2, which (no surprise to me) represents unpredictable levels of mastery. 

Now here in North Carolina, students who score a Level 2 on the end of grade exams are given a retest the following week.  While teachers are able to give students remediation lessons between the first testing session and the second, there really isn't much that anyone can do to improve a child's reading ability in a week. 

So I sat with Jamison and reminded him of how important it was to work from the beginning of the test to the end.  We also reviewed a bit of poetry, considering its status as the genre middle school boys like the least!  Then, I crossed my fingers and hoped for the best!  "Work as hard on the last reading selection as you do on the first," I told Jamison the night before his retest. 

When his results came back, Jamison's scale scores were something like 10 points HIGHER than they'd been the year before!  In the span of one week, he'd seen a swing of over 20 points in his scores.  As a comparison, middle grades students see an average of somewhere between 3 and 7 points growth from year-to-year on reading exams. 

20 points of academic growth in a week with little to no remediation is simply ridiculous.

Think about what kind of consequences that has for me as both a teacher and a tutor.  I'm left to wonder which of Jamison's two scores was "the right score."  Was it his first attempt, which saw him struggle mightily?  If so, I need to seriously look at the instructional strategies that I'm choosing for students like Jamison because something's not working.

Or was it his second score, which saw him outperform his peers?  Because if it was, I need to seriously look at the instructional strategies that I'm choosing for students like Jamison because I've discovered the key to leaving no child behind!

The really sad thing is that Jamison's story isn't unique by any means.  Anecdotally, I see Jamisons in my classrooms and schools every single year.  Rarely do retested students see their scores drop from one session to the next—-and it's not unusual to see a child move from "struggling mightily" to "spot on" in no time. 

Why does this kind of thing happen? 

My guess:  Jamison worked harder on the second test than he did the first.  For Jamison, fear of failing the grade level determination to do well—instead of pure academic ability—was the factor that most influenced his final score.

And thankfully, other really, really smart people have seen the same trend in student testing scores.  Take Malcom Gladwell for example.  In his newest book Outliers, Gladwell reviews the work of Erling Boe, a researcher at the University of Pennsylvania who noticed an interesting trend while studying the TIMSS exam—a math and science test given to samples of fourth and eighth grade students in countries around the world.  

Like most standardized exams, the TIMSS test begins with a survey that asks students about topics ranging from their opinions towards math and the amount of time spent on homework outside of class to the highest level of education that their parents have ever reached. 

Unlike most standardized exams, however, the TIMSS survey is nothing short of grueling. 

It is comprised of something like 120 questions—and remember, the students taking the TIMSS test are either 9-year old fourth graders or 13-year old eighth graders.  (When was the last time that you asked 9 year olds to fill out a 120 question survey?!)  As Gladwell notes, the TIMSS survey is a test of student determination.  "It is so tedious and demanding," he writes, "that many students leave as many as ten or twenty questions blank."

What Erling Boe discovered next, however, was the really interesting part.  As Gladwell writes:

"As it turns out, the average number of items answered on that questionnaire varies from country to country.  It is possible, in fact, to rank all the participating countries according to how many items their students answer on the questionnaire. 

Now what do you think happens if you compare the questionnaire rankings with the math rankings on the TIMSS? 

They are exactly the same.  In other words, countries whose students are willing to concentrate and sit still long enough and focus on answering every single question in an endless questionnaire are the same countries whose students do the best job of solving math problems."

(Kindle Location 2972-2996)

Amazing, isn't it?  What Boe discovered is something that every elementary and middle school teacher has known for as long as they've stood in front of squirmy tweens:  End of grade reading and math tests really aren't reading and math tests at all. 

Instead, they're tests of a student's resilience, determination, and mental stamina

The kids who do the best on end of grade exams are like those who do the best on the TIMSS exams:  They're willing to sit still and concentrate.  They take every question seriously—including those that come at the end of a three-hour testing session two weeks before the end of the school year.  

Does that mean that kids like Jamison—who score poorly on end of grade exams—are struggling with grade level content? 

The sad fact is that it's just plain impossible to say.  Jamison could be falling behind academically.  But he might also be the kind of twelve-year-old who struggles to concentrate as he plugs through reading passages on topics that he's not interested in or that he has no first-hand experience with. 

Don't get me wrong:  End of grade tests have their place.  They allow schools to make comparisons across large samples of students to identify trends in populations.  After looking at the results of all of our sixth graders this year, our teachers may discover that our students struggle with expository text structures or with identifying bias—and those are trends that will help us to tail
or our instruction for next year.

But I simply can't believe that end of grade tests are reliable indicators of an individual child's academic ability when I see students raise their scores by 20 points in a week.  Such unpredictability calls into question the meaning of every score generated by the magic bubbles, regardless of whether they are higher or lower than expected.

9 thoughts on “Why Test Scores are ALMOST Useless to Me

  1. David Cohen

    It’s always interesting to read about the perspectives of those of you dealing with earlier grades. Tenth and eleventh graders, in my experience, are commonly resentful or apathetic about testing. There is a small percentage that acknowledge filling in random bubbles, others who blow through giving a half-hearted effort, some who probably mean well but hit the fatigue wall, and some who have the endurance but not the desire to really work at the hard ones, since the tests are meaningless to the students. Many others who do quite well have totally gamed the system on the language arts tests and understand that they do not need to read a long, dense passage in order to answer questions like “In line 4, the most likely meaning of the word ‘horizon’ is…” The variability in student performance, based on factors you mentioned and others, is an issue ignored by too many people.

  2. Kathie Marshall

    Bill, Jamison reminds me of students in my sixth and seventh grade reading intervention course. When comparing the DRP scores that got them into the intervention versus their end of course scores, two students rose three grade levels. Great curriculum? Great teacher? However, six or seven students actually got lower scores! Terrible curriculum? Terrible teacher? Like you, I believe this discrepancy has to do with the students’ ability and/or willingness to keep on the task and persist through the difficult entries all the way to the end. The fastest done were the lowest scorers. However, if outsiders were to look at this data, I wonder what all of the erroneous conclusions that could be raised.

  3. Cristine Clarke, Ed. D.

    and end of grade tests have become the only relevant standard schools use to assess special education needs. if the child passes EOG’s they can’t have a learning disability…huh? this kind of weight on one measurement misses the point of education entirely

  4. Sam

    I should have said, I don’t need to reassess as much. Of course, good assessments are a necessary tool. I was referring, though, to the type of comprehensive assessments that often take place at the beginning of a school year.

  5. TeachMoore

    Thanks (Bill and K. Borden)for the graphic and specific breakdown of some of the (many) flaws in relying on standardized tests alone to make crucial decisions about children’s education. My experience with state testing data is that it was too generic and too late to use in any meaningful way in the classroom. Some of the newer classroom level data systems that allow teachers or teams of teachers to track student performance over time; however, are much more useful, but still give only a partial picture as your post notes. Nothing like the on-the-ground observations of a trained teacher or an attentive parent.

  6. Joel Zehring

    It’s the dark under-belly of the “data-driven” beast. What educators have failed to remember is that data can be both quantitative and qualitative. How many questions did she answer correctly?” is one question, but it should be followed by “How did she answer these questions correctly?” and “What algorithms and intelligences did she use?”
    I hear very few people ask what standardized tests cannot tell us about students.
    I’ve just seen a report from the Broader Bolder Approach organization that recommends school and state performance assessments that do not rely exclusively on standardized test results. Download it here: http://www.boldapproach.org/report_20090625.html

  7. K. Borden

    Sam said: “I suppose this is an argument for “looping” with students, because when I do that, I don’t need to reassess. I save a lot of time and can move ahead faster.”
    I suppose you could say we loop now with our daughter, given that we are the ones teaching her consistently now. But, after our journey, we have learned to value multiple and diverse assessments, so we don’t rely on just those we do.

  8. K. Borden

    Mr. Ferriter said “I have little confidence in (EOGs) them as a measure of an individual child’s abilities and—given the choice—would take observations by both parents and teachers in every circumstance.
    That’s not a very tempered response, is it?!”
    Let me throw you a curveball that actually gives more evidence for the dangers in relying on EOGs. My daughter was not placed under an IEP because her Level 4 performance on the EOGs did not show her disability impacted her performance. Yup, a child with loads of data not typically available (occupational therapists reports, psych-ed evaluations…) documenting the existence of a disability could not qualify for an IEP because…she performed so darned well on the EOGs that her “disability was not depriving her of a free and appropriate” education. Laughing or crying yet?
    That said, it was still the pesky EOG’s that provided a contrast to the observations of classroom teachers that prompted us to ask more questions, seek more information and ultimately find the answers that have allowed a very able child develop the skills to achieve. Teachers wrote her off as failing to complete assignments and failing to use time wisely for years. EOGs noted her as exceeding grade level expectations. Fortunately, we could afford to take that contrast and try to determine what the dickens was going on.
    Teachers were right about her failing to complete assignments but wrong about why. EOG’s were right about what she was learning (to the extent the ceilings measure it) but couldn’t detect the co-existing disability. Simon asked me in the previous thread what teachers missed. Stated simply, they missed that she was holding the pencil in a way that made writing very slow, very labor intensive and very inefficient. So yeah, she could fill in ovals to correctly respond to questions very well, but she couldn’t complete assignments. Was she using her time wisely? Were any of us helping her much when we insisted that if she just tried with a more diligent attitude she could meet expectations?
    We did help her when we noted a discrepancy the EOGs played a part in helping to demonstrate and went forward to seek more testing and evaluation. We opened a whole new world to her, began to teach her to type and she has taken it from there. We were able to do that because we could afford to engage the diagnostics. Along the journey also found that she was pre-diabetic and had allergies we were not aware existed. Challenges that can be addressed if you just know they exist.
    The EOGs are not useless, but using them inappropriately is dangerous business. When she was evaluated further, she was indeed exceeding grade level in ability.
    Your observations of Jamison were “unpredictable on a good day”, “when he was on, he was brilliant” and “when he was off, he’d be throwing his shows across the room”. Then his tests results on EOG’s reflected what could be described as an “unpredictable” swing between being on and off over a two week period. It could be said that both the Scranton scoring machine and your observations reported consistent results for inconsistently performing Jamison.
    You said: “20 points of academic growth in a week with little to no remediation is simply ridiculous.”
    However, you earlier noted that “When his results came back, Jamison’s scale scores were something like 10 points HIGHER than they’d been the year before!” (and noted an average of a 3 to 7 point increase year to year for middle grades generally.) If I am reading this correctly, if you compare his retest result to his previous year result, it demonstrates about 3 points above average growth year to year.
    The picture emerging of Jamison from both your observations and the tests is one of a student performing inconsistently at times, with a tendency toward demonstrating competence.
    Wouldn’t it be great to help Jamison find the tools, skills and habits that would allow him to demonstrate the abilities you observed and his performance over time on the tests indicate he is capable of doing? Is it possible that his 3 point over average year over year growth reflects you did? The really tough part is determining what it is that is causing the inconsistency he demonstrates at times in class and on the tests.
    I wish all parents could afford to do what we did and recognize the need to do it. It can change a child’s life.

  9. Sam

    Hi Bill,
    I agree completely. I’d also add to your answer that the most useful information to me, as a classroom teacher, has always been my own assessments, including observations and conversations. My assessments are most useful because I fully understand the way in which the student was prepared for the assessment. I know what was taught ahead of time. I have the context.
    I suppose this is an argument for “looping” with students, because when I do that, I don’t need to reassess. I save a lot of time and can move ahead faster.
    But in terms of new students, I would imagine that any teacher who values assessment data finds their own assessments more valuable than both test scores and the assessment results of previous teachers.
    And those teachers who truly value test results can always just give a practice test on their own time. There is no need to rely on the results of a test given up to 8 months beforehand.

Comments are closed.