The best guide to AI typing detection comes from Wikipedia


We’ve all felt the creeping suspicion that something we’re reading was written by a major linguistic model – but it’s extremely difficult to pin down. For a few months last year, everyone became convinced that specific words like “depth” or “underscore” could reveal patterns, but the evidence is weak, and as models developed more complex, telltale words became harder to track.

But as it turns out, the folks at Wikipedia have gotten pretty good at distinguishing prose written by AI — and the group’s general guide to “Artificial Intelligence Writing Markers” It’s the best resource I’ve found for determining whether your suspicions are justified. (Credit to poet Jameson Fitzpatrick, who pointed out the document relating to X.)

Since 2023, Wikipedia editors have been working on handling AI submissions, a project they call Project AI Cleanup. With millions of edits coming in every day, there’s plenty of material to draw on, and in classic Wikipedia editor style, the group has produced a detailed, evidence-packed field guide.

First of all, the evidence confirms what we already know: automated tools are basically useless. Instead, the guide focuses on habits and phrases that are rare on Wikipedia but common on the Internet in general (and, therefore, common in the model training data). According to the guide, AI submissions will spend a lot of time emphasizing the importance of the topic, usually in general terms like “pivotal moment” or “broader movement.” AI models will also spend a lot of time detailing simple informational points to make the topic stand out — something you’d expect from a personal resume, but not from an independent source.

The evidence points to a particularly interesting feature regarding items containing ambiguous claims of significance. Models will say that some event or detail “underscores the importance” of one thing or another, or “reflects the continuing importance” of some general idea. (Grammar nerds will know this as the “present participle.”) It’s a little hard to define, but once you can recognize it, you’ll see it everywhere.

There is also a tendency towards using vague marketing language, which is very common on the Internet. The landscape is always stunning, the views are always breathtaking, and everything is clean and modern. As the editors put it, “It looks like the script for a TV commercial.”

The guide is worth reading in its entirety, but I was very impressed with it. Before that, I would have said that LLM prose was evolving too quickly to be pinpointed. But the habits mentioned here are deeply rooted in the way AI models are trained and deployed. They can be hidden, but it will be difficult to get rid of them completely. And if the general public becomes smarter at identifying AI prose, that could lead to all sorts of interesting consequences.

Leave a Reply

Your email address will not be published. Required fields are marked *