Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
Lianghuan Huang, Yingshan Chang
TL;DR
The paper addresses how decodability and causality diverge in vision transformers trained for object counting. It combines activation patching (to assess causal influence) with linear probing (to assess decodability) across layers and token types. The key finding is a systematic mismatch: middle-layer object tokens causally influence predictions despite weak decodability, while final-layer object tokens are highly decodable but act inertly, and CLS tokens become causally powerful only at the end. This demonstrates that information present is not the same as information used, underscoring the need for dual analyses to uncover hidden computations in ViTs and advance mechanistic interpretability.
Abstract
Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability--the recoverability of information from hidden states--and causality--the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation--what information is present versus what is used--and that their divergence can expose hidden computational circuits.
