Table of Contents
Fetching ...

Activation Transport Operators

Andrzej Szablewski, Marek Masiak

TL;DR

Activation Transport Operators (ATOs) define explicit, regularised linear maps that predict downstream residuals from upstream residuals and test local linear transport in decoder-only transformers by evaluating in Sparse Autoencoder (SAE) feature space. The authors introduce transport efficiency and derive an upper bound on predictive variance via $R^2_{\text{ceiling}}$ from canonical correlations, linking this efficiency to the size of the Linear Transport Subspace (LTS). Empirically on a Gemma 2 2B model with SAEs, they show most linear transport occurs over short distances and in earlier layers, with transport diminishing for larger leaps and depth; ATOs cause only small increases in perplexity, enabling targeted diagnostics and potential low-cost edits. The results provide a compute-light probe of linear channels in LLMs and suggest avenues for attention-mediated routing and feature-targeted interventions.

Abstract

The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals $k$ layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream's subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.

Activation Transport Operators

TL;DR

Activation Transport Operators (ATOs) define explicit, regularised linear maps that predict downstream residuals from upstream residuals and test local linear transport in decoder-only transformers by evaluating in Sparse Autoencoder (SAE) feature space. The authors introduce transport efficiency and derive an upper bound on predictive variance via from canonical correlations, linking this efficiency to the size of the Linear Transport Subspace (LTS). Empirically on a Gemma 2 2B model with SAEs, they show most linear transport occurs over short distances and in earlier layers, with transport diminishing for larger leaps and depth; ATOs cause only small increases in perplexity, enabling targeted diagnostics and potential low-cost edits. The results provide a compute-light probe of linear channels in LLMs and suggest avenues for attention-mediated routing and feature-targeted interventions.

Abstract

The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream's subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.

Paper Structure

This paper contains 13 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: ATO predicts downstream residual stream vector. Using an SAE, we identify activated features. True and predicted residuals are projected onto SAE decoder vectors and compared.
  • Figure 2: Per-feature $R^2$ of operators depend on both the target layer depth and the leap size $k$.
  • Figure 3: Average per-feature $R^2$ for all source-target combinations. Note that the sets of chosen SAE features are different across target layers, hence values in the same column may not be directly comparable. Constant leap sizes $k$ are represented by the diagonals.
  • Figure 4: Transport efficiency for the target layer 10 with different leap ($k$) values.
  • Figure 5: Log-perplexity for unedited and ablated models. Ablated five positions per sequence.