Characterizing stable regions in the residual stream of LLMs

Jett Janiak; Jacek Karwowski; Chatrik Singh Mangat; Giorgi Giglemiani; Nora Petrova; Stefan Heimersheim

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, Stefan Heimersheim

Abstract

We identify stable regions in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.

Characterizing stable regions in the residual stream of LLMs

Abstract

Paper Structure (18 sections, 4 equations, 26 figures, 1 table)

This paper contains 18 sections, 4 equations, 26 figures, 1 table.

Introduction
Related work
Methods
Experiments
Illustrative examples
Impact of model size
Impact of training progress
Discussion
Acknowledgements
Comparison of OLMo and Qwen2 model families
Similar and dissimilar prompts
Details of the 2D slice visualization
More 2D slice plots
Shapes when patching after 1st vs after 7th layer
Direct comparison with polytopes
...and 3 more sections

Figures (26)

Figure 1: Visualization of stable regions in OLMo-7B during training. Colors represent the similarity of model outputs to those produced by three model-generated activations (red, green, blue circles). Each subplot shows a 2D slice of the residual stream after the first layer at different stages of training, with the number of processed tokens indicated in the titles. As training progresses from left to right, distinct regions of solid color emerge and the boundaries between them sharpen. Refer to the end of \ref{['sec:experiments']} for details.
Figure 2: (a,b) Relative output distance as a function of $\alpha$ for (a) similar and (b) dissimilar pairs of prompts in $\texttt{Qwen2-0.5B}$. (c) Normalized logit difference between top prediction for $p_B$ and $p_A$.
Figure 3: (a,b) Median relative output distance as a function of interpolation coefficient $\alpha$ for different models from the (a) $\texttt{OLMo}$ and (b) $\texttt{Qwen2}$ families. (c) Maximum slope as a function of the number of parameters for both model families. Dots represent median, and error bars represent 25th and 75th percentiles.
Figure 4: (a,b) Median relative output distance as a function of $\alpha$ for (a) $\texttt{OLMo-1B}$ and (b) $\texttt{OLMo-7B}$ models. (c) Maximum slope as a function of the number of training tokens for both models. Dots represent median, and error bars represent 25th and 75th percentiles. Note that in subfigures (a) and (b) we only show a few selected checkpoints for readability.
Figure 5:
...and 21 more figures

Characterizing stable regions in the residual stream of LLMs

Abstract

Characterizing stable regions in the residual stream of LLMs

Authors

Abstract

Table of Contents

Figures (26)