Modern large language models are immense black boxes, with their parameter counts well in the trillions. Yet they still represent internal features in a standard way that, by monitoring the appropriate parts, we can begin to understand what a model is thinking at runtime.
Training a black box
The single key discovery of the past few years is the simple realisation that pre-training works. It is possible to take vast amounts of raw, unlabelled data and, by essentially just throwing enough compute at the problem, train a reasonably configured learning algorithm to understand that data on its own. We've seen this play out in language modelling, in image and video generation, and now in the move to world modelling. No human supervision is required; the models learn complex relationships by themselves.
The catch is that the end result of this training is a black box. A model which transforms inputs into desired outputs in its own idiosyncratic and unintuitive way. This makes the model incredibly hard to control and debug. Language models are trained on human text but they don't act like us. They have their own communication style, approach to problems, and even ethical values.
If we wish to improve these models, control them more effectively, or apply them in critical industries where reliability is non-negotiable, we have to get past this black box nature and understand the model behaviour.
This is the promise of mechanistic interpretability: to break a model down into a set of algorithms that a user can understand and follow. This will allow us to understand how models perform the operations that they do and, possibly more importantly, why they have done them. To understand what a model is thinking about during inference is the ultimate goal of mechanistic interpretability.
Technical Background
Most modern mechanistic interpretability work is done on large pre-trained transformers, such as ChatGPT or Claude. These models are trained on a simple semi-supervised objective, such as predicting the next tokens or removing noise from an image, at immense scale.
Once you strip away the facade, a transformer is little more than a chain of matrix multiplications whose weights are learned to best represent the underlying data. These weights, and the architecture underpinning them, specify the model entirely. Every decision made by a modern large language model, whether it's writing code, poetry or LinkedIn slop, can be entirely explained by the model weights.
Transformer layers are connected in a fundamentally different way than more basic architectures such as multilayer perceptrons (MLPs) or convolutional networks (ConvNets). In a transformer, the layers are connected by a kind of information highway, referred to as the 'residual stream'. Each layer "reads" from the residual stream, taking it as the input, the layer then performs some operation on this data, and "writes" the output back to the stream by adding its output back.
Essentially, the residual stream as it leaves layer L-1 is the only input of layer L. Based on this simple observation, we can extrapolate that all the high-level ideas a model has during a forward pass are completely contained within this residual stream.
We should keep in mind that mechanistic interpretability is a vast field, with many methods for understanding all different kinds of models. We are specifically focusing here on supervised learning approaches which investigate the residual stream states of language models. This is a major area within interpretability, but you, dear reader, should bear in mind that there is much more out there that we have not discussed in this post.
Linear Representation Hypothesis
Given that all of a language model's thoughts must be compressed into this residual stream, we would expect that it would be utilised extremely well leaving us with a very poor signal to noise ratio. How would we even begin to try and break down and understand information moved in this stream?
The saving grace for us is the Linear Representation Hypothesis (LRH). Layers read from the residual stream by multiplying it against their attention matrices, producing attention scores which are passed throughout the model. Because the multiplications are linear we would expect language models, or any large transformers for that matter, to represent features linearly. For further reading on this topic, we recommend this excellent blog post by Liv Gorton: What Would Non-Linear Features Actually Look Like?
To be more precise, 'representing features linearly' means that linear directions in the residual stream correspond to the semantic concepts that the model is processing. In this representation, the magnitude in that direction corresponds to the intensity of that given feature.
This is reasonably well grounded empirically. All kinds of features, such as model refusal, show a strong linear component in the residual stream. As such, in order to find a given piece of information appropriately represented in the model's internal states, it is sufficient to train a linear classifier on the residual stream states.
Proxy White-Box Interpretability
These techniques are more than just academic. They give us a way to monitor models at runtime, allowing us to remove damaging outputs and keep our agents on track for longer. But they share one hard requirement: access to the weights.
That leaves two options. Use only open-weight models you run yourself, or find a way to carry analysis of open-weight models over to closed, black-box ones.
The latter approach, referred to as Proxy White-Box Interpretability, is what we have adopted at Telluvian. The theory is that models trained on similar data end up with similar internal structures. A direction that means refusal, or deception, in an open model we can take apart will have a close analog in the closed model. So, by measuring the open model, we have an implicit measurement within the black box. The match is never exact but it is close enough to catch what matters: the moment an agent starts to be uncertain. That allows us to run our probe on a model we never had the weights of.