Table of Contents
Fetching ...

Information Flow Routes: Automatically Interpreting Language Models at Scale

Javier Ferrando, Elena Voita

TL;DR

This work reframes Transformer computations as information flowing along routes between token representations, and introduces an attribution-based, top-down method to automatically extract the important information flow subgraphs for any given prediction. By using ALTI-inspired edge attributions rather than activation patching, the approach is significantly faster (roughly 100x) and can surface both general and contrastive task-specific circuits without human-designed templates. Empirical results on Llama 2 show that certain heads, such as previous-token and subword-merging heads, are broadly important, while others specialize to domains like coding or multilingual text; the method also reveals POS- and subword-related patterns and bottom-up as well as top-down information flow dynamics. The approach thus provides a scalable, versatile framework for interpreting LM behavior, enabling more systematic comparisons across predictions, templates, and domains with practical implications for model debugging, safety, and alignment.

Abstract

Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token representations and edges to operations inside the network. We automatically build these graphs in a top-down manner, for each prediction leaving only the most important nodes and edges. In contrast to the existing workflows relying on activation patching, we do this through attribution: this allows us to efficiently uncover existing circuits with just a single forward pass. Additionally, the applicability of our method is far beyond patching: we do not need a human to carefully design prediction templates, and we can extract information flow routes for any prediction (not just the ones among the allowed templates). As a result, we can talk about model behavior in general, for specific types of predictions, or different domains. We experiment with Llama 2 and show that the role of some attention heads is overall important, e.g. previous token heads and subword merging heads. Next, we find similarities in Llama 2 behavior when handling tokens of the same part of speech. Finally, we show that some model components can be specialized on domains such as coding or multilingual texts.

Information Flow Routes: Automatically Interpreting Language Models at Scale

TL;DR

This work reframes Transformer computations as information flowing along routes between token representations, and introduces an attribution-based, top-down method to automatically extract the important information flow subgraphs for any given prediction. By using ALTI-inspired edge attributions rather than activation patching, the approach is significantly faster (roughly 100x) and can surface both general and contrastive task-specific circuits without human-designed templates. Empirical results on Llama 2 show that certain heads, such as previous-token and subword-merging heads, are broadly important, while others specialize to domains like coding or multilingual text; the method also reveals POS- and subword-related patterns and bottom-up as well as top-down information flow dynamics. The approach thus provides a scalable, versatile framework for interpreting LM behavior, enabling more systematic comparisons across predictions, templates, and domains with practical implications for model debugging, safety, and alignment.

Abstract

Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where nodes correspond to token representations and edges to operations inside the network. We automatically build these graphs in a top-down manner, for each prediction leaving only the most important nodes and edges. In contrast to the existing workflows relying on activation patching, we do this through attribution: this allows us to efficiently uncover existing circuits with just a single forward pass. Additionally, the applicability of our method is far beyond patching: we do not need a human to carefully design prediction templates, and we can extract information flow routes for any prediction (not just the ones among the allowed templates). As a result, we can talk about model behavior in general, for specific types of predictions, or different domains. We experiment with Llama 2 and show that the role of some attention heads is overall important, e.g. previous token heads and subword merging heads. Next, we find similarities in Llama 2 behavior when handling tokens of the same part of speech. Finally, we show that some model components can be specialized on domains such as coding or multilingual texts.
Paper Structure (47 sections, 11 equations, 15 figures)

This paper contains 47 sections, 11 equations, 15 figures.

Figures (15)

  • Figure 1: The important information flow routes for a token (Mary) prediction. GPT2-Small, $\tau = 0.04$.
  • Figure 2: Full information flow graph.
  • Figure 3: General-case algorithm for extracting the important subgraph of an information flow graph.
  • Figure 4: Decomposition of an update coming from an attention head into per-input terms. Layer indices are omitted for readability.
  • Figure 5: Decomposition of an update coming from an entire attention layer into per-input terms. Layer indices are omitted for readability.
  • ...and 10 more figures