Table of Contents
Fetching ...

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger

TL;DR

<3-5 sentence high-level summary> The paper investigates how recurrent networks store and generate sequences, challenging the Strong Linear Representation Hypothesis by showing that small GRUs encode sequence position as layered magnitudes rather than directions, forming onion representations. Through a battery of intervention-based analyses (unigram, bigram, and onion probes), it demonstrates that linear, direction-based encodings emerge only in larger models, while smaller models rely on non-linear, magnitude-based structures requiring autoregression to access stored tokens. The work introduces onion representations as a robust counterexample to LRH and argues for expanding causal testing beyond linear subspaces. These findings have implications for interpretability research, suggesting that many mechanistic insights may lie outside traditional linear frameworks, especially in memory- and sequence-centric tasks.

Abstract

The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

TL;DR

<3-5 sentence high-level summary> The paper investigates how recurrent networks store and generate sequences, challenging the Strong Linear Representation Hypothesis by showing that small GRUs encode sequence position as layered magnitudes rather than directions, forming onion representations. Through a battery of intervention-based analyses (unigram, bigram, and onion probes), it demonstrates that linear, direction-based encodings emerge only in larger models, while smaller models rely on non-linear, magnitude-based structures requiring autoregression to access stored tokens. The work introduces onion representations as a robust counterexample to LRH and argues for expanding causal testing beyond linear subspaces. These findings have implications for interpretability research, suggesting that many mechanistic insights may lie outside traditional linear frameworks, especially in memory- and sequence-centric tasks.

Abstract

The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.
Paper Structure (45 sections, 11 equations, 5 figures, 5 tables)

This paper contains 45 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: We find that GRUs solve a repeat task by learning a scaling factor corresponding to each sequence position, leading to layered onion-like representations. In this simplified illustration, the learned token embeddings (a) are rescaled to have magnitudes proportional to their sequence positions (b). To change an element of the sequence, remove (c) and replace (d) the token embedding at the given positional magnitude. The layered nature of the representations makes them non-linear; any direction will cross-cut multiple layers of the onion.
  • Figure 2: The input gate ${\bm{z}}_t$ in GRUs learning different representations Yellow is open; dark blue is closed; $y$-axis is the channel; $x$ axis is the position. Both models use input gates to let in different proportions of each dimension across the sequence in order to store the positions of the input tokens. The large model (left) sharply turns off individual channels to mark position; in contrast, the small model (right) gradually turns off all channels.
  • Figure 3: The intervention described by Equations \ref{['eq:scaledintstart']}--\ref{['eq:scaledintend']} where the input sequence is $(a,b,c,d)$ and the intervention is to fix the second position to be the token $c$.
  • Figure 4: Accuracy of different probes on the final representation ${\bm{h}}_L$ of GRUs with $N=64$ and autoregressive input (mean of 5 runs; $\pm$ 1 s.d.). Only the probes that use autoregressive denoising can successfully decode the sequence.
  • Figure 5: All 1024 channels of the GRU gate ${\bm{z}}_t$ shown in Figure \ref{['fig:first_64_of_1024']}. All channels follow similar patterns.