I Feel Like I'm Taking Crazy Pills

Eric Ekholm

18 Apr 2025 — 3 min read

I recently stumbled onto a preprint research article examining the writing quality of college students before and after the introduction of ChatGPT, as well as some reporting on the same article published on the Hechinger Report. And oh boy did reading it make me feel like Will Ferrell in Zoolander.

The authors of the study analyzed 1.1 million discussion forum posts (i.e. mostly brief, informal writing assignments) from ~16,000 college students submitted between Fall 2021 and Winter 2024, and they estimated the writing quality of each post. The crux of the paper was to compare student writing quality before and after the emergence of ChatGPT (and other LLMs). I’m sure you can guess what the findings are, but I’ll describe them anyway.

First, the posts submitted in the during-LLM era were higher-quality than the posts submitted in the pre-LLM era. Second, “linguistically disadvantaged students” (which basically means students with lower entering writing scores) showed larger improvements than their peers. The authors also report a third finding that linguistically disadvantaged students of higher socioeconomic status (SES) seem to benefit more than linguistically disadvantaged students of lower SES, but that’s less relevant to what I want to get to.

Maybe I’m missing something, but I…don’t understand the point of this study. Their findings are basically 1) ChatGPT produces better-quality writing than a lot of college students, and 2) It produces much better writing than college students who are bad writers.

Do we need a study to tell us this? What are we even doing here?

Whether it’s intentional or not, the authors of the study are doing some real sleight-of-hand with the term “writing quality” here. They’re comparing the quality of a lot of (over a million!) pre- and during-LLM era forum posts, and the finding is, not surprisingly, that the quality of the posts in the during-LLM era is, on average, higher.

Right.

Notice that they’re not saying that any given student’s writing improved. They’re also not saying that students became more proficient writers. The authors can’t actually make these claims based on their analyses, even though this last one – does using LLMs help students become more proficient writers? – is the question that actually matters. Because that’s a causal question, and answering that question requires using causal methods. The classical way of testing this would involve a randomized controlled trial (RCT) where we randomly select students, split them into treatment and control groups, assess their baseline writing quality in a standardized environment (without using LLMs), let the treatment group use an LLM to assist with their writing during a semester, preclude the control group from using LLMs during that same semester, then retest both groups (with no LLMs!) at the end of the semester and compare the writing quality. You could even look at the interactions and subgroup effects if you sampled students carefully enough.

You could do something similar with a natural experiment, which is kinda-sorta what the authors are doing here but not actually, where you compare writing quality before and after some event (in this case, the emergence of LLMs). But to make claims about student writing proficiency, you need the writing samples to be produced without the use of LLMs.

Imagine we're living millennia ago, and part of school for early early humans involved picking up heavy rocks. Imagine we have data describing the weight of rocks that thousands of humans could lift. We dutifully record the heaviest rock that each student lifts per week on our papyrus scrolls with our feather quills. Then imagine someone invents the pulley. Now imagine people are allowed to use the pulley to lift their rocks, but we’re still collecting data describing the weight of the rocks they’re lifting. Obviously, people are going to be able to lift heavier rocks. That's the whole point of the pulley – it provides mechanical advantage! But it would be insane to conclude that anyone is stronger.

Or you could imagine a similar scenario with calculators and arithmetic. Have people become, on average, better at arithmetic since the advent of calculators? My guess is no, although we certainly can arrive at solutions faster.

Renzhe Yu, the first author of the study even says this much to Jill Barshay, author of the Hechinger Report synopsis. He says “It all comes down to motivation. If they’re not motivated to learn, then students will only make bad use of whatever the technology is.” I.e. they’ll just get ChatGPT to do their writing for them and not actually develop as writers.

So why bother with the study in the first place?

I Feel Like I'm Taking Crazy Pills

Eric Ekholm

Read more

RAG time

Reaction to Emily Oster's "why we need test scores"

Writing for busy readers, and why not to use AI

When do predictions about students' future performance help?