How We Detect AI Hallucinations in Real-Time

AI hallucinations—when models generate plausible-sounding but factually incorrect content—represent one of the most significant barriers to trustworthy AI deployment. In high-stakes domains like legal practice, even a single hallucinated case citation can have serious consequences. This post explains the technical approach we use at Telluvian to detect hallucinations in real-time.

The Core Insight: Models Know When They're Uncertain

Research in mechanistic interpretability has revealed something important: language models contain internal signals that correlate with factual accuracy. When a model is "recalling" information it learned during training versus "generating" plausible completions, these processes leave different traces in the model's internal activations.

Our approach exploits this insight. By analyzing the patterns of activity in a model's hidden layers as it generates each token, we can identify when the model is likely to be hallucinating—often before the hallucinated content even appears in the output.

Probing Classifiers: Reading the Model's Mind

The technical foundation of our approach is the probing classifier. A probing classifier is a simple model (typically a linear classifier or small neural network) that's trained to predict some property of the model's output based on its internal activations.

In our case, we train probing classifiers to distinguish between:

Tokens that are factually accurate based on the model's training data
Tokens that represent confabulated or uncertain content
Tokens in domains where the model has limited training data

These classifiers are trained on labeled datasets where we know ground truth—for example, legal citations that can be verified against actual case databases, or factual claims that can be checked against authoritative sources.

Token-Level Analysis

One of the key features of our approach is that it operates at the token level. Rather than making a single confidence judgment about an entire response, we assign hallucination probabilities to each token as it's generated.

This granular approach has several advantages:

Precision: We can identify exactly which parts of a response are reliable and which are suspect
Real-time feedback: Users see confidence information as the response is generated, not after
Nuanced assessment: A response can be partially reliable—accurate in some sections and uncertain in others

Legal-Specific Training

General-purpose hallucination detection has value, but legal AI requires domain-specific attention. Legal hallucinations often have specific patterns:

Citation hallucinations: Fabricated case names, incorrect volume/reporter numbers, or wrong years
Holding mischaracterization: Accurate case citations but incorrect descriptions of what the case held
Statutory errors: Misquoted or misinterpreted statutory language
Jurisdictional confusion: Applying law from the wrong jurisdiction

Our classifiers are trained specifically on legal content, with datasets curated to capture these domain-specific failure modes.

Integration with Scanf

The technical approach described here powers our Scanf product. When you use Scanf, each token in the model's response is analyzed by our probing classifiers in real-time. Tokens with high hallucination probability are visually highlighted, giving you immediate feedback about which parts of the response warrant additional verification.

Limitations and Future Work

No hallucination detection system is perfect. Our approach has limitations:

Detection accuracy depends on the quality and coverage of our training data
Novel types of hallucinations may not be caught until we update our classifiers
Some hallucinations are inherently difficult to detect from internal activations alone

We're continuously improving our classifiers and expanding our training data to address these limitations. Our goal is not to replace human judgment, but to augment it with reliable signals about AI confidence.

Conclusion

Real-time hallucination detection based on mechanistic interpretability represents a significant advance in AI safety for high-stakes domains. By looking inside AI models rather than just monitoring their outputs, we can provide the kind of reliability information that legal professionals need to use AI confidently and effectively.