This illustration photograph taken on October 30, 2023, shows the logo of ChatGPT, a language model-based chatbot developed by OpenAI, on a smartphone in Mulhouse, eastern France. (Photo by SEBASTIEN BOZON / AFP) (Photo by SEBASTIEN BOZON/AFP via Getty Images)
AFP via Getty Images
A new analysis in Science puts a number on a question that has worried university faculty since ChatGPT arrived: how many students cheat with generative AI? Drawing on 95,513 students at a representative sample of twenty major public research universities, the authors estimate that about 9% of students who use these tools have turned in AI-generated work they knew might not be allowed. They are careful to note that 9% is lower than many accounts of AI normalizing cheating at scale.
Two things make the result more interesting than the number itself: how the authors arrived at it, and what happens when you break it down by field, where AI use and cheating run one way across disciplines and the opposite way across students.
How Do You Count Cheaters Who Won’t Admit It?
Any cheating statistic invites the obvious objection that students lie about cheating, and the authors built their estimate to sidestep it.
Rather than ask anyone to confess, they used a list experiment. Students were split at random into two groups. One saw three harmless statements about AI use, such as having explained ChatGPT to a classmate, and reported only how many were true for them. The other saw those three plus a fourth, that they had submitted AI work as their own knowing it might not be allowed, and again reported only the count. Because no one ever marks the sensitive item by itself, the difference in average counts between the groups recovers the share who acknowledge cheating while leaving every answer deniable.
The authors add that the figure may be an undercount, since some students do not realize their own use breaks a rule, but this undercount is only those who committed the crime unwittingly.
What Varies and What Holds Steady
Unsurprisingly, student use of generative AI swings enormously by field. Computer science students report using AI regularly at 62%, against 24% in the arts. The cheating rate barely moves by comparison. The authors find it somewhat higher in non-STEM fields, where adoption tends to be lower, with economics at 17% and journalism at 16%, and lower in parts of STEM such as biology, at 5%. Across majors, then, heavier adoption goes with slightly less cheating.
But cheating varies far less than use does. Adoption runs from a quarter of students to nearly two-thirds across fields, while the share of users who cheat stays roughly between 5% and 17%. How much a discipline has embraced AI tells you little about how much its students cheat, and economics, high on both counts, shows the two do not always move together.
At the level of the individual student the relationship reverses and sharpens. Students who use AI daily cheat at 26%, against 7% for those who use it only monthly. The harder a given student leans on the tools, the more likely that reliance crosses into misconduct.
A weak negative pattern across disciplines and a strong positive one across students is a version of Simpson’s paradox, and the gap is easy to misread. Cheating is estimated only among students who already use AI, so a low-adoption field like the arts is describing a small, self-selected group rather than its whole roster. Aggregating to the major also buries the individual signal, since a field can hold many occasional, legitimate users whose presence holds its rate down.
The Access Concern
The authors raise a second point that deserves scrutiny. They document sizable gaps in who uses AI: 33% of women report regular use against 45% of men, and 29% of underrepresented minority students against 39% of their white and Asian peers. They interpret these gaps as a question of equitable access, suggesting that students from underrepresented backgrounds may have less access to, or familiarity with, the tools.
The access half of that explanation is hard to believe. A general-purpose subscription costs about $20 a month compared with tuition that in the United States runs into the tens of thousands, so cost is an unlikely barrier for enrolled students. The gaps also move in ways price cannot explain, widest by gender in health sciences and economics and by race in the arts, humanities and computer science. Familiarity and differing norms about when leaning on AI is appropriate are likelier drivers, and they call for different remedies. The authors are right that the gaps bear on any reform that assumes students can use AI well, but to me it appears that the cause is more cultural than economic.
What’s Worth Grading Now?
If we strip away the framing, a finding emerges that does not depend on either reading. As AI spreads, a polished final product becomes weaker evidence of what a student can do without help, which threatens any assessment that grades the artifact rather than the work behind it. The authors make this case carefully, and they are skeptical of the usual fixes, calling detection a cat-and-mouse game and warning that ostensibly AI-proof exams rarely capture the judgment a degree is meant to certify.
The harder implication is one they leave alone. Many of the capabilities these assessments measure, the routine production of clean prose and working code, are precisely the ones employers are starting to hand to machines. An assessment a model can pass was often testing a skill already losing its market value, which turns the validity problem into a sharper question than detection: what should a degree certify once routine production is automated? Two possibilities are judgment and synthesis, the reasoning that does not reduce to a finished document, but is correspondingly hard to test.
The Science study is most valuable as measurement, the largest careful estimate we have of how much AI-assisted cheating is happening, and its method is clear about the limits of asking. It was fielded in 2024, so its use figures are best read as a floor. The number everyone will quote is 9%. The number worth sitting with is how much of what we currently grade will still be worth grading once a machine can do it on command.
