Activation Transport Operators

Andrzej Szablewski; Marek Masiak

Activation Transport Operators

Andrzej Szablewski, Marek Masiak

TL;DR

Activation Transport Operators (ATOs) define explicit, regularised linear maps that predict downstream residuals from upstream residuals and test local linear transport in decoder-only transformers by evaluating in Sparse Autoencoder (SAE) feature space. The authors introduce transport efficiency and derive an upper bound on predictive variance via $R^2_{\text{ceiling}}$ from canonical correlations, linking this efficiency to the size of the Linear Transport Subspace (LTS). Empirically on a Gemma 2 2B model with SAEs, they show most linear transport occurs over short distances and in earlier layers, with transport diminishing for larger leaps and depth; ATOs cause only small increases in perplexity, enabling targeted diagnostics and potential low-cost edits. The results provide a compute-light probe of linear channels in LLMs and suggest avenues for attention-mediated routing and feature-targeted interventions.

Abstract

The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals $k$ layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream's subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.

Activation Transport Operators

TL;DR

from canonical correlations, linking this efficiency to the size of the Linear Transport Subspace (LTS). Empirically on a Gemma 2 2B model with SAEs, they show most linear transport occurs over short distances and in earlier layers, with transport diminishing for larger leaps and depth; ATOs cause only small increases in perplexity, enabling targeted diagnostics and potential low-cost edits. The results provide a compute-light probe of linear channels in LLMs and suggest avenues for attention-mediated routing and feature-targeted interventions.

Abstract

layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream's subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.

Activation Transport Operators

TL;DR

Abstract

Activation Transport Operators

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)