Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger
TL;DR
<3-5 sentence high-level summary> The paper investigates how recurrent networks store and generate sequences, challenging the Strong Linear Representation Hypothesis by showing that small GRUs encode sequence position as layered magnitudes rather than directions, forming onion representations. Through a battery of intervention-based analyses (unigram, bigram, and onion probes), it demonstrates that linear, direction-based encodings emerge only in larger models, while smaller models rely on non-linear, magnitude-based structures requiring autoregression to access stored tokens. The work introduces onion representations as a robust counterexample to LRH and argues for expanding causal testing beyond linear subspaces. These findings have implications for interpretability research, suggesting that many mechanistic insights may lie outside traditional linear frameworks, especially in memory- and sequence-centric tasks.
Abstract
The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.
