Guide

How to Read a Detection Report: Making Sense of Your Score

April 26, 2026 · 8 min read · By Colin
TL;DR

Most people who run text through an AI detector look at the number and form an opinion based on whether it's above or below some threshold they have in their head. Above 70%: probably AI. Below 30%: probably human. 50%: unclear, move on.

That's a reasonable instinct and it's wrong in a useful way — wrong in the sense that the aggregate score is a significant compression of much more specific information, and acting on the aggregate alone means missing most of what the detection actually found.

I built Content Trace partly because I wanted a detection tool that showed its work — that gave you the category breakdown rather than just the number. But even with a breakdown available, I see people default to the aggregate. This post is an attempt to explain what you're actually looking at when you run a detection, and how to use the results in a way that's actually useful.

8
Weighted categories in Content Trace's analysis

Each category tells a different story. Cognitive fingerprinting (16%) and structural patterns (12%) are the hardest signals to fake.

Free · Always

The two families of signal — and why they diverge

Detection signals break into two broad families: statistical and behavioral. Understanding what each measures is essential to reading what a score is telling you.

Statistical signals measure properties of the text itself — word predictability (perplexity), sentence length variation (burstiness), vocabulary range, and similar surface features. These signals emerge from how language models generate text mechanically: by choosing statistically likely continuations at each step, they produce output that's measurably more uniform than human writing in certain ways.

Behavioral signals measure the presence or absence of things that humans do when they write — opinion drift across a piece, self-corrections and caveats, specific authentic detail that doesn't fully serve the argument, the kind of structural discovery that happens when someone is actually working something out rather than delivering a pre-formed answer.

The critical insight is that these two families respond differently to editing and post-processing. Statistical signals change when text is paraphrased or restructured. Behavioral signals don't — they're properties of the underlying thinking, not the surface text. When you see a big gap between these two families in a detection result, it's telling you something specific about what happened to the content.

Reading the patterns — what different score profiles mean

High behavioral, low statistical

This is the most common pattern for content that was AI-generated and then processed through a paraphrasing tool. The statistical surface has been disrupted — word choices are more varied, sentence lengths less uniform — but the behavioral markers didn't move because they're not in the text surface. The opinions are still uniformly consistent. There's still no self-correction. The examples are still constructed rather than remembered.

If you're editing content and see this pattern, the paraphrasing is done. That's not your problem anymore. Your problem is everything in the behavioral categories, and the fix is actual editorial work: adding a genuine position, a real sourced claim, a detail from experience. Adding more paraphrasing will do nothing useful.

High statistical, low behavioral

This pattern is less common and more interesting. Statistically the text looks like AI — low perplexity, high uniformity — but behaviorally it looks human. This typically means one of three things: very structured human writing in a formal register (legal documents, academic papers, technical documentation), a human writer who happens to write in a particularly uniform style, or AI output that had genuine human knowledge injected into it without restructuring the surface text.

Structured human writing flags this way because academic and legal register is deliberately uniform — precise vocabulary is a feature, varied sentence length is a bug. If you're running that kind of content through detection, a high statistical score is expected and shouldn't be alarming. The behavioral signals will tell you whether there's genuine thinking in the content, which matters more.

Both high — close to pure AI

When both categories score high, the content is very likely largely unedited AI output. The statistical patterns of language model generation are present and the behavioral signals of human cognition are absent. This is the profile of a piece that went from prompt to publish with minimal intervention.

The specific categories worth looking at here: cognitive fingerprinting (authentic specificity, self-correction), opinion and perspective (is there a genuine take anywhere in the piece?), and structural patterns (does the argument develop or just get delivered?). These three categories carry the most weight and are the most actionable.

Both low — genuinely human or heavily edited

Low scores across both families mean either genuinely human writing or content where substantial human editing added both surface variation and cognitive depth. This is the target profile for AI-assisted content — not necessarily zero statistical AI signal, but low enough on both dimensions that the human contribution is clearly dominant.

It's worth noting that both-low doesn't mean "definitely human." A skilled writer who produces very human-sounding AI content, or AI output that went through extensive genuine editing, can land here too. Detection is probabilistic at every score. The question is always what's more likely given the full profile.

The middle range — where the interesting stuff lives

Score Interpretation
What different aggregate ranges typically mean
85–100%Very likely AI or minimally edited
65–84%Probably AI with some editing; behavioral signals elevated
35–64%Mixed — examine category breakdown closely
15–34%Probably human or heavily edited AI
0–14%Very likely human writing

Scores in the 35–65% range are often where the most useful information lives. A 90% score is straightforward — the content is probably mostly AI. A 50% score with high behavioral and low statistical signals tells you something much more specific: the text has been edited but the editing didn't add human thinking. That's a precise diagnosis with a precise fix.

The right question to bring to a detection result

The question "is this AI?" is rarely the useful one. It's a binary framing for a probabilistic measurement, and acting on it tends to produce either false confidence (low score = fine) or false alarm (high score = discard).

The question I'd recommend instead: which categories in this result are high, and what would it take to move them? High cognitive fingerprinting means add authentic specificity — a real example, a sourced detail, something from memory. High opinion uniformity means add a genuine take — a position that could be wrong, that someone with knowledge of the topic might disagree with. High structural pattern score means the argument was delivered rather than developed — add the working-through, the caveat, the discovery.

Detection used this way is an editing tool. Not a judgment on whether the content is acceptable, but a map of where the human contribution is thin and where more editorial work would improve it. That's the frame that makes detection results actionable — and it's a more honest account of what they actually measure.