From independent patches to coordinated attention: Controlling information flow in vision transformers
Kieran A. Murphy
TL;DR
This work introduces a training-time, per-head variational information bottleneck in vision transformers to explicitly control information flow from attention to the residual stream. By tuning a single parameter $\beta$, the model spans a spectrum from independent patch processing to fully expressive global attention, enabling tractable mechanistic analysis of how global representations emerge from local patches. Key findings show that early, high-information-efficiency messages often encode patch repetition or redundancy, and that a sparse set of heads becomes active under strong information restriction, providing a concrete, interpretable view of emergent transformer circuits. The approach offers a principled laboratory for studying information routing in vision transformers and holds promise for improved interpretability and controllability in practical vision systems.
Abstract
We make the information transmitted by attention an explicit, measurable quantity in vision transformers. By inserting variational information bottlenecks on all attention-mediated writes to the residual stream -- without other architectural changes -- we train models with an explicit information cost and obtain a controllable spectrum from independent patch processing to fully expressive global attention. On ImageNet-100, we characterize how classification behavior and information routing evolve across this spectrum, and provide initial insights into how global visual representations emerge from local patch processing by analyzing the first attention heads that transmit information. By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.
