Why Your AI Detector Score Keeps Changing (And What to Do About It)
- →Different detectors use different signal mixes and different reference models — disagreement between tools is expected, not a sign something's broken.
- →Score variation within a single tool across runs usually reflects stochastic model components, segmentation differences, or silent model updates.
- →Short texts (under 300 words) produce much less reliable scores. The signal-to-noise ratio drops fast as length decreases.
- →The most reliable way to read scores: look at category breakdowns across tools, not a single aggregate from one tool.
- →Score instability across tools is itself information — it signals genuine ambiguity in the text, which is worth investigating rather than resolving with a verdict.
Someone I know spent twenty minutes last week running the same blog post through four AI detectors. GPTZero: 12%. Winston AI: 67%. Copyleaks: 88%. Originality.ai: 34%. Same text. Four tools. She came away convinced the whole category was useless.
I don't think that's the right conclusion — but I understand why it's the one she reached. The variation is real, it's often large, and no one explains it clearly. The answer to why detector scores diverge this much is actually interesting, and understanding it changes how you use these tools.
Below 300 words, signal-to-noise drops significantly. Short texts produce wider variance across tools and across runs on the same tool.
Why different tools give different scores
Different detectors are measuring different things — and that's the whole story, honestly. A tool primarily measuring perplexity will give you a different answer than a tool weighting behavioral signals, because the same text can score very differently on those two dimensions. High perplexity, low opinion uniformity, or the reverse. The tools aren't disagreeing on a fact; they're measuring different properties of the text.
Beyond signal choice, there's the reference model problem. Perplexity is always calculated against a specific language model — how surprising are these word choices given what GPT-2, GPT-4, or some other model would predict? Different tools use different reference models. Text that reads as "expected" against one model may look "surprising" against another. If the text was generated by a model that wasn't used as a reference — Claude 3.5 on a detector trained primarily on GPT-3.5 output, for instance — perplexity-based detection becomes less reliable in ways that are hard to predict from the outside.
And then there's training data composition. Every detector is trained on labeled datasets of human and AI text. The time period those datasets cover, which AI models contributed the AI examples, which domains and genres the human examples came from — all of this shapes what the detector has learned to recognize. Two tools trained on meaningfully different datasets will make different predictions on edge cases. And most interesting text is, in some sense, an edge case.
Why scores vary across runs of the same tool
This one surprises people more than the cross-tool variation. Run the same text through the same detector twice and get 71%, then 68%, then 74%. If the tool is deterministic — same input, same output — why?
A few reasons. Some tools use language models with stochastic inference components, meaning the same input can produce slightly different outputs across calls. Some tools analyze text by segment and aggregate the results — and short or medium-length texts don't always get segmented identically, which affects the aggregate. Some tools update their classifiers without announcing it (this is frustratingly common), so the tool you used last Tuesday might be running a slightly different model than it was last Monday.
Small variation across runs of the same tool — a few percentage points — is normal and expected. Large swings (more than 10 points on the same text) suggest either a stochastic model or a tool that isn't stable. That instability is itself a signal about tool quality, and it's worth noting before you make any consequential decisions based on a score.
The text length problem — and why it matters more than people realize
Every AI detector becomes less reliable as text length drops. This is fundamental — you need enough data to establish a stable statistical estimate. Below about 300 words, scores become substantially less trustworthy. Below 150 words, they're nearly meaningless on most tools.
The reason: most signals need multiple instances to establish a pattern. Perplexity, sentence length variation, vocabulary diversity, behavioral pattern frequency — each of these requires the text to do enough things for the detector to see whether those things follow a systematic pattern or appear randomly. A single sentence can be surprisingly constructed by either a human or an AI. Several paragraphs give you enough instances to start reading the pattern.
Here's the frustrating practical implication: a lot of the content people most want to check — social posts, email copy, short-form ad copy — is exactly the length range where detection is least reliable. A 280-character tweet scoring 85% AI is nearly meaningless. A 1,500-word article scoring 85% AI, with consistent signals across multiple categories, is meaningfully informative.
Score instability is information, not failure
Here's what took me a while to internalize: when a piece of text scores very differently across multiple detectors, that divergence is telling you something about the text — not just about the tools.
Text that's clearly AI-generated tends to score consistently high across well-built detectors, because multiple independent signal types are all pointing the same direction. Text that's clearly human tends to score consistently low for the same reason. Text that produces wildly divergent results is sitting in genuinely ambiguous territory — heavily edited AI content where real human signal was added but not uniformly; structured human writing that overlaps with AI patterns in certain categories; mixed-source content where sections have different origins.
The divergence between tools isn't failing to find the truth. It's accurately representing uncertainty. And the appropriate response to genuine uncertainty is closer examination — not picking the highest score and treating it as the verdict, and not picking the lowest and treating that as a clearance.
"Score instability across tools is a feature — it's telling you the text is genuinely ambiguous."The interesting question is why it's ambiguous, not which tool to trust.
How to actually get more consistent results
Submit longer text. If you're checking something under 400 words, try combining it with adjacent content to give the detector more to work with. More text means more stable estimates.
Look at category breakdowns, not aggregate scores. A single number hides which signals are driving the result. Category-level scores tell you what's actually going on — and they're more stable than aggregates, because individual categories are less affected by single outlier sentences.
Run multiple tools and look for convergence, not average. Consistent high scores across tools with different signal mixes is meaningful. One high score out of four isn't. You're looking for where the independent analyses agree.
Treat the 30–70% band as a flag, not a verdict. Scores in that range mean the text has features pointing in both directions. That's a reason to examine the specific categories that are high — not a reason to conclude either way.
To see what category-level analysis actually looks like in practice, run your text through Content Trace. The 32-signal breakdown shows you exactly which patterns are driving the result, across 8 independently weighted categories. And the full explainer on how detection works covers the underlying mechanics if you want to go deeper.
Frequently asked questions
Is one AI detector more accurate than the others?
Different tools are more accurate in different contexts — on content from specific AI models, in specific domains, at specific text lengths. There's no single most-accurate tool across all scenarios. Tools weighting behavioral signals tend to be more robust to bypass attempts. Tools relying primarily on perplexity tend to be more vulnerable to paraphrasing.
Why does the same text score differently when I add a paragraph?
Because detection analyzes patterns across the full text. Adding a paragraph shifts the overall distribution of signals — the ratio of high-perplexity to low-perplexity sentences, how often behavioral patterns occur, overall vocabulary diversity. If the new paragraph has different characteristics than the existing text, it moves the aggregate score. This is expected behavior, not a bug.
Should I trust a tool that claims perfect consistency?
Be skeptical. Some variation across runs is normal and expected, especially for tools with stochastic inference components. A tool claiming perfect consistency may be using a purely deterministic classifier that doesn't account for the probabilistic nature of the underlying signals — which is a different kind of problem. Confidence intervals are more honest than single-point estimates.
What does a score in the 40–60% range actually mean?
It means the text is genuinely ambiguous — features pointing both directions, and no detector can reliably distinguish edited AI content from structured human writing in that range. Rather than trying to resolve it with a verdict, look at the specific categories that are high. That's where the useful information is. The breakdown of behavioral signals covers what to look at specifically.
Does detector accuracy degrade as AI models improve?
For statistical and perplexity-based detection, yes — newer models produce text that's harder to distinguish statistically from human writing. For behavioral signal detection, less so, because the signals being measured (opinion drift, self-correction, authentic specificity) reflect genuine cognitive processes that current AI models still don't reproduce reliably. This is why the shift toward behavioral analysis matters for long-term detection robustness.