Table of Contents
Fetching ...

From independent patches to coordinated attention: Controlling information flow in vision transformers

Kieran A. Murphy

TL;DR

This work introduces a training-time, per-head variational information bottleneck in vision transformers to explicitly control information flow from attention to the residual stream. By tuning a single parameter $\beta$, the model spans a spectrum from independent patch processing to fully expressive global attention, enabling tractable mechanistic analysis of how global representations emerge from local patches. Key findings show that early, high-information-efficiency messages often encode patch repetition or redundancy, and that a sparse set of heads becomes active under strong information restriction, providing a concrete, interpretable view of emergent transformer circuits. The approach offers a principled laboratory for studying information routing in vision transformers and holds promise for improved interpretability and controllability in practical vision systems.

Abstract

We make the information transmitted by attention an explicit, measurable quantity in vision transformers. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream -- without other architectural changes -- we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.

From independent patches to coordinated attention: Controlling information flow in vision transformers

TL;DR

This work introduces a training-time, per-head variational information bottleneck in vision transformers to explicitly control information flow from attention to the residual stream. By tuning a single parameter , the model spans a spectrum from independent patch processing to fully expressive global attention, enabling tractable mechanistic analysis of how global representations emerge from local patches. Key findings show that early, high-information-efficiency messages often encode patch repetition or redundancy, and that a sparse set of heads becomes active under strong information restriction, providing a concrete, interpretable view of emergent transformer circuits. The approach offers a principled laboratory for studying information routing in vision transformers and holds promise for improved interpretability and controllability in practical vision systems.

Abstract

We make the information transmitted by attention an explicit, measurable quantity in vision transformers. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream -- without other architectural changes -- we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.
Paper Structure (21 sections, 6 equations, 12 figures)

This paper contains 21 sections, 6 equations, 12 figures.

Figures (12)

  • Figure 1: Limiting information written by attention to the residual stream.(a) We install an information bottleneck (IB) after every attention head and before anything is written to a patch's residual stream. The sum total of information penalties is added to the original training loss, scaled by a parameter $\beta$ that induces a spectrum from independent patch voting (no attention-mediated communication) to free-flowing information as in an unmodified ViT. (b) By installing IBs after each of the 36 attention heads in a ViT-tiny trained on a subset of Imagenet, we obtain a spectrum of models parameterized by the total amount of information written to the residual stream by attention heads. As the amount of information increases, so too does accuracy, smoothly covering the span between the ViT without attention (left) and the unmodified ViT (right). Error bars are shown only for the stochastic IB models (standard deviation across 10 validation runs). (c) The voting behavior of patches in a single image also smoothly interpolates between the two extremes as a function of $\beta$. We measure the average range of patch logits (min to max) and the variety of top-assigned classes across patches, using the inverse Simpson index as an effective count of diversity. Every dot represents an image from the dataset (colored by $\beta$ with the same mapping as in b), and the median with interquartile ranges is shown for the point clouds with significant overlap.
  • Figure 2: KL allocation across patches. For a random sample of validation images that ViT correctly classified, we show the total KL per patch (summed across all attention heads in the model) for a selection of models nearest to the Pareto front in Fig. \ref{['fig:fig1']}. Note that the colormaps have the same range for each model (column).
  • Figure 3: Attention head firing patterns. Survival function $P(\text{KL}\ge x)$ of per-patch information across attention heads, showing that only a small fraction of patches carry significant routed information in the high-$\beta$ regime. The grids on the right show which heads are active, starting with block 0 at the bottom.
  • Figure 4: Extreme activating patches. For the single active attention head in the $\beta=10$ model (head 11.0), and for one of the four effectual heads of the $\beta=4$ model (head 10.0), we randomly sample positive and negative activations from patches in the top 0.1% of KL cost. The patch is indicated by a red square in both the original image (top of each pair of rows) and the head's attention map (bottom of each pair of rows).
  • Figure 5: Probing the repetition hypothesis.(a) Given a patch and a base image, we randomly copy the given patch to $N$ other locations in the image before passing the augmented image through the model. For head 11.0 of the $\beta=10$ model, the repetition of the patch systematically drives the attention head's update representation to larger negative values. The corresponding attention maps highlight the copy locations. (b) A random selection of 1024 patches in the dataset was augmented with patch copies as in a, at three different magnitudes and for three different attention heads (from the $\beta=10$ and $\beta=4$ models). We display the base image's representation mean, $\mu_0$, and the signed displacement in representation space, $\mu_\text{aug}-\mu_0$, caused by the augmentation.
  • ...and 7 more figures