Reaction to Emily Oster's "why we need test scores"

Eric Ekholm

28 Mar 2025 — 4 min read

Photo by Nguyen Dang Hoang Nhu / Unsplash

Programming Note: I’m doing an intensive AI training next week, so there will be no post on 4/4. I’ll be back again the following week.

Emily Oster recently published an article (post? blog?) on her website, ParentData, titled “Why We Still Need Test Score Data,” and I wanted to both react to that here.

Before that, though, as a brief aside, Emily Oster’s writing is great. If you’re a parent and you don’t subscribe to ParentData, stop reading right now and go subscribe. It’s free (well, there’s a paid plan, too, but it’s optional), and Emily does a great job balancing empirically-based research with the practical day-to-day concerns of parenting. Her book, Cribsheet, is also worth a read if you have young kiddos.

Anyway, to the article at hand. In it, Oster touches on 3 things. She:

Reviews patterns of school closures during COVID;
Describes what happened to test scores during COVID; and
Describes post-COVID changes in test scores, including speculating on where we go from here

The article itself is fairly short, and Oster writes well, so I’d recommend reading it. I won’t regurgitate all the details here, but the gist is basically that lots of school districts closed during COVID, and there was quite a bit of variation in when they returned to in-person teaching and learning. Not surprisingly, test scores declined during COVID, and the declines tended to be associated with remote learning (i.e. the longer kids stayed fully remote, the larger the test score declines).

This all sets up an analysis that shows the degree to which states have “recovered” pandemic-related learning loss since the return to fully in-person instruction in fall 2021. The graphs displaying these analyses are clear and worth looking at themselves. To summarize, though, most states seem to be demonstrating English/Language Arts (ELA) proficiency rates near what their rates were pre-pandemic, with some states actually surpassing previous proficiency rates. The math picture is a bit bleaker, with most folks still at least ~2% below where they were previously, although there’s quite a bit of variability here, too.

Oster’s two takeaways are:

“The variation in recovery across states should be an opportunity for learning.” and
“We need to continue to collect and analyze data like this, since understanding these patterns is key to finding policies that work.”

I mostly agree with her, but let’s dig into these takeaways, starting with the first point.

Yes, educators at all levels – states, school districts, schools, and classrooms – should look to “better performing” peers and try to learn from them. At the classroom and school level, this is kinda what professional learning communities (PLCs) are for (they do more than this, too). And this obviously isn’t unique to education, either. There's tons of content – both academic and pop-culture – about learning from experts. On a darker note, the whole fitness influencer phenomenon is based on this same idea – that normal people can (and want to) adopt the practices of incredibly fit people, which mostly leads to scams and snake-oil peddling (although there obviously trustworthy influencers, too).

The issue here is that “learning from other states” may not be as straightforward as we hope. This doesn’t mean we shouldn’t try, but it does mean we should approach the whole endeavor with a healthy dose of skepticism. In her article, Oster explicitly asks“what is Mississippi doing that Massachusetts is not?” This refers to her findings indicating that Mississippi was one of the best performing states (in terms of recovery), whereas Massachusetts was one of the worst (again, in terms of recovery). So maybe Massachusetts could learn something from Mississippi?

Well, maybe. Much has been made of the “Mississippi Miracle,” and several states, including Virginia, have adopted laws that mirror Mississippi’s Literacy Based Promotion Act (LBPA). There’s probably something useful in adopting “science-based reading” strategies (as opposed to, like, vibes-based strategies, I suppose) and providing literacy coaches to support schools. On the other hand, Mississippi’s LBPA requires 3rd-grade students to be retained if they don’t hit a certain proficiency level, and when you retain your lowest-performing students in 3rd grade, your 4th grade proficiency scores will obviously improve.

I don’t know enough about what happened (or continues to happen) in Mississippi to rigorously critique their proficiency scores or to say that their improvements are real or manipulated. It’s probably a bit of both. My bigger point is that it’s very difficult to attribute changes to any single factor, particularly in real-world, non-experimental settings where there are a gorillion different variables we need to account for. COVID may have been the biggest thing happening in the world in 2019-2021, but it wasn’t the only thing, and when we're making comparisons across states, it's difficult to account for all of the differences that could influence student learning.

More so, people are generally pretty bad at accurately attributing causality. We have all sorts of biases (e.g. self-serving bias, hindsight bias) and limitations that lead us to retrospectively tell convenient stories that may not truly represent the causal mechanisms. But they sound plausible. So there’s a psychological problem at play here, too, if someone from Massachusetts calls up someone from Mississippi and says “explain what’s working for you.” This isn’t to say the person from Massachusetts shouldn’t make this call, but rather that they shouldn’t take the Mississippian’s answer as, like, gospel.

That is, even if the effects in Mississippi are real, it's not obvious to me that someone in the Mississippi Department of Education could accurately convey the whole causal apparatus.

The second point Oster makes is that we ought to continue to collect and analyze “data like this” (i.e. state testing data), since it can help inform us about what policies are working. It's hard to argue with this....and yet, here we are, hamstringing our ability to collect this data. The logic (if there is any actual logic) behind gutting the US Department of Education has never been clear to me. It seems like, regardless of our political leanings, we would want data to help determine what’s working and what isn’t? But maybe facts don't matter in 2025.

Reaction to Emily Oster's "why we need test scores"

Eric Ekholm

Read more

Writing for busy readers, and why not to use AI

When do predictions about students' future performance help?

Does your data's structure match the question you want to answer?

Reviewing VDOE's school performance framework (pt 3)