How We Detect AI Hallucinations in Real-Time
← All posts

How We Detect AI Hallucinations in Real-Time

AI hallucinations—when models generate plausible-sounding but factually incorrect content—represent one of the most significant barriers to trustworthy AI deployment. In high-stakes domains like legal practice, even a single hallucinated case citation can have serious consequences. This post explains the technical approach we use at Telluvian to detect hallucinations in real-time.

The Core Insight: Models Know When They're Uncertain

Research in mechanistic interpretability has revealed something important: language models contain internal signals that correlate with factual accuracy. When a model is "recalling" information it learned during training versus "generating" plausible completions, these processes leave different traces in the model's internal activations.

Our approach exploits this insight. By analyzing the patterns of activity in a model's hidden layers as it generates each token, we can identify when the model is likely to be hallucinating—often before the hallucinated content even appears in the output.

Probing Classifiers: Reading the Model's Mind

The technical foundation of our approach is the probing classifier. A probing classifier is a simple model (typically a linear classifier or small neural network) that's trained to predict some property of the model's output based on its internal activations.

In our case, we train probing classifiers to distinguish between:

These classifiers are trained on labeled datasets where we know ground truth—for example, legal citations that can be verified against actual case databases, or factual claims that can be checked against authoritative sources.

Token-Level Analysis

One of the key features of our approach is that it operates at the token level. Rather than making a single confidence judgment about an entire response, we assign hallucination probabilities to each token as it's generated.

This granular approach has several advantages:

Legal-Specific Training

General-purpose hallucination detection has value, but legal AI requires domain-specific attention. Legal hallucinations often have specific patterns:

Our classifiers are trained specifically on legal content, with datasets curated to capture these domain-specific failure modes.

Integration with Scanf

The technical approach described here powers our Scanf product. When you use Scanf, each token in the model's response is analyzed by our probing classifiers in real-time. Tokens with high hallucination probability are visually highlighted, giving you immediate feedback about which parts of the response warrant additional verification.

Limitations and Future Work

No hallucination detection system is perfect. Our approach has limitations:

We're continuously improving our classifiers and expanding our training data to address these limitations. Our goal is not to replace human judgment, but to augment it with reliable signals about AI confidence.

Conclusion

Real-time hallucination detection based on mechanistic interpretability represents a significant advance in AI safety for high-stakes domains. By looking inside AI models rather than just monitoring their outputs, we can provide the kind of reliability information that legal professionals need to use AI confidently and effectively.