Table of Contents
Fetching ...

Uncovering the Computational Roles of Nonlinearity in Sequence Modeling Using Almost-Linear RNNs

Manuel Brenner, Georgia Koppe

TL;DR

The paper introduces Almost-Linear RNNs (AL-RNNs) to systematically study when nonlinear recurrence is truly necessary for sequence modeling. By controlling the number of nonlinear units $P$ and partitioning the hidden state into $2^P$ linear subregions via piecewise linear units, the work reveals how memory can be implemented through slow linear modes while nonlinear switches enable gating, routing, and regime changes. Across a diverse set of tasks—including sentiment and image/audio sequence integration, memory recall, gating in addition, contextual integration, and SCAN—the authors show that sparse nonlinearity greatly improves interpretability and sample efficiency, enabling reusable nonlinear motifs in multi-task settings. In contrast, fully nonlinear models often underperform or train less robustly, highlighting the inductive benefit of sparsity. The results provide a principled design principle for balancing performance, efficiency, and interpretability in recurrent architectures and offer a framework for analyzing latent dynamics in both artificial and biological systems.

Abstract

Sequence modeling tasks across domains such as natural language processing, time series forecasting, and control require learning complex input-output mappings. Nonlinear recurrence is theoretically required for universal approximation of sequence-to-sequence functions, yet linear recurrent models often prove surprisingly effective. This raises the question of when nonlinearity is truly required. We present a framework to systematically dissect the functional role of nonlinearity in recurrent networks, identifying when it is computationally necessary and what mechanisms it enables. We address this using Almost Linear Recurrent Neural Networks (AL-RNNs), which allow recurrence nonlinearity to be gradually attenuated and decompose network dynamics into analyzable linear regimes, making computational mechanisms explicit. We illustrate the framework across diverse synthetic and real-world tasks, including classic sequence modeling benchmarks, a neuroscientific stimulus-selection task, and a multi-task suite. We demonstrate how the AL-RNN's piecewise linear structure enables identification of computational primitives such as gating, rule-based integration, and memory-dependent transients, revealing that these operations emerge within predominantly linear backbones. Across tasks, sparse nonlinearity improves interpretability by reducing and localizing nonlinear computations, promotes shared representations in multi-task settings, and reduces computational cost. Moreover, sparse nonlinearity acts as a useful inductive bias: in low-data regimes or when tasks require discrete switching between linear regimes, sparsely nonlinear models often match or exceed fully nonlinear architectures. Our findings provide a principled approach for identifying where nonlinearity is functionally necessary, guiding the design of recurrent architectures that balance performance, efficiency, and interpretability.

Uncovering the Computational Roles of Nonlinearity in Sequence Modeling Using Almost-Linear RNNs

TL;DR

The paper introduces Almost-Linear RNNs (AL-RNNs) to systematically study when nonlinear recurrence is truly necessary for sequence modeling. By controlling the number of nonlinear units and partitioning the hidden state into linear subregions via piecewise linear units, the work reveals how memory can be implemented through slow linear modes while nonlinear switches enable gating, routing, and regime changes. Across a diverse set of tasks—including sentiment and image/audio sequence integration, memory recall, gating in addition, contextual integration, and SCAN—the authors show that sparse nonlinearity greatly improves interpretability and sample efficiency, enabling reusable nonlinear motifs in multi-task settings. In contrast, fully nonlinear models often underperform or train less robustly, highlighting the inductive benefit of sparsity. The results provide a principled design principle for balancing performance, efficiency, and interpretability in recurrent architectures and offer a framework for analyzing latent dynamics in both artificial and biological systems.

Abstract

Sequence modeling tasks across domains such as natural language processing, time series forecasting, and control require learning complex input-output mappings. Nonlinear recurrence is theoretically required for universal approximation of sequence-to-sequence functions, yet linear recurrent models often prove surprisingly effective. This raises the question of when nonlinearity is truly required. We present a framework to systematically dissect the functional role of nonlinearity in recurrent networks, identifying when it is computationally necessary and what mechanisms it enables. We address this using Almost Linear Recurrent Neural Networks (AL-RNNs), which allow recurrence nonlinearity to be gradually attenuated and decompose network dynamics into analyzable linear regimes, making computational mechanisms explicit. We illustrate the framework across diverse synthetic and real-world tasks, including classic sequence modeling benchmarks, a neuroscientific stimulus-selection task, and a multi-task suite. We demonstrate how the AL-RNN's piecewise linear structure enables identification of computational primitives such as gating, rule-based integration, and memory-dependent transients, revealing that these operations emerge within predominantly linear backbones. Across tasks, sparse nonlinearity improves interpretability by reducing and localizing nonlinear computations, promotes shared representations in multi-task settings, and reduces computational cost. Moreover, sparse nonlinearity acts as a useful inductive bias: in low-data regimes or when tasks require discrete switching between linear regimes, sparsely nonlinear models often match or exceed fully nonlinear architectures. Our findings provide a principled approach for identifying where nonlinearity is functionally necessary, guiding the design of recurrent architectures that balance performance, efficiency, and interpretability.

Paper Structure

This paper contains 56 sections, 1 theorem, 16 equations, 27 figures, 3 tables.

Key Result

Proposition 1

An AL-RNN without ReLU nonlinearity cannot solve the addition problem.

Figures (27)

  • Figure 1: Illustration of the AL-RNN and bitcode assignment. The example displays a 6-dimensional AL-RNN with four linear and two PWL units (left). The PWL units partition the state space into four linear subregions (right). Bitcodes encode positive $(1)$ and negative $(0)$ activation values for these units. Each subregion corresponds to a unique bitcode (00, 01, 10, 11) and is governed by a distinct linear dynamical system (DS) with its own recurrence matrix $\mathbf{W}_{\Omega^i}$. An example trajectory (bottom) traverses a sequence of these subregions over time, experiencing discrete switches in dynamics when crossing subregion boundaries (marked by dashed vertical lines). Within each subregion, the dynamics remain linear.
  • Figure 2: Top row: Sentiment classification on IMDb reviews. a: Test accuracy (y-axis) as a function of nonlinear units, $P$. b: Trajectory through latent space showing sentiment-relevant keywords guiding classification. c: The dominant first PC clearly separates positive and negative reviews. d: Bitcode frequencies are highly concentrated on a small subset of linear subregions. Bottom row: Digit classification on sequential MNIST. e: Test accuracy (y-axis) as a function of nonlinear units (x-axis; mean $\pm$ std over 10 seeds). f: Final latent state projection onto the first 3 PCs show that nonlinearity partitions latent space by class.
  • Figure 3: a: Structure of the copy task and two example trajectories for $P=1$. b: Top: Activity of the PWL unit ($P=1$) for the input sequences in a. The latent activity follows a complex limit cycle primarily located in one linear subregion (PWL unit negative), which switches to the second subregion during decoding (PWL unit positive). Bottom: Autonomous activity of the AL-RNN in the absence of inputs encodes a 100-cycle, with its transient located only in one subregion (PWL unit negative). c: Symbol‐wise recall accuracy (mean$\pm$ std over 10 seeds) vs. number of PWL units $P$. d: Histogram of binary “bitcodes” during recall for $P=10$, concentrated on a small subset out of $2^{10}=1024$ possible bitcodes. e: Explained variance ratio of PCs of latent network activity (for $P=1$) indicates relatively high-dimensional, complex dynamics.
  • Figure 4: a: Performance as a function of units $P$ (mean$\pm$ std over 10 seeds). b: Latent trajectories in PWL space with $P=2$ nonlinear units. The second linear subregion is selectively activated only at the two masked time points ("input"), indicated by sharp transitions across quadrant boundaries. Outside these events, the network remains in a single linear regime, leading to smooth integration of the cumulative sum. c: Time course of masked inputs (red), network outputs (grey/black), and target values (blue) for two example trials, where the input occurs either early (light grey) or late (black) in the sequence. In both cases, the trajectory initially follows a linear drift prior to input, then transitions sharply to an elevated integration path with the same slope—offset to reach the target after 100 steps.
  • Figure 5: AL-RNN training on a multi-task paradigm. (a) Test accuracy (mean across 5 models per setting) across 11 cognitive tasks as a function of $P$ (PWL units), for different training set sizes. For limited data (20 samples per task), sparse nonlinearity ($P=1$ to $P=8$) outperforms both linear and fully nonlinear models by reusing structures tasks. For medium data (50 samples per task), more nonlinearity provides additional benefit, while sparse nonlinearity ($P=4$ to $P=16$) performs best. With abundant data (500 samples per task, right), this advantage diminishes as fully nonlinear models can afford task-specific solutions, though sparsely nonlinear architectures remain competitive. (b) Task similarity via Jensen-Shannon divergence between bitcode distributions (mean across 5 independent training runs). Sparse nonlinearity ($P=2, 4$) reveals interpretable structure where related tasks share subregions—for instance, pro variants cluster together while anti variants form a separate group. As nonlinearity increases, this structure vanishes as the model has sufficient capacity to learn essentially independent representations for each task, reducing overlap and eliminating the interpretable similarity structure visible at lower $P$.
  • ...and 22 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof