Table of Contents
Fetching ...

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

Shota Takashiro, Masanori Koyama, Takeru Miyato, Yusuke Iwasawa, Yutaka Matsuo, Kohei Hayashi

Abstract

We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

Abstract

We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.

Paper Structure

This paper contains 40 sections, 6 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Token-wise accuracy vs. sequence length on the Dyck-$(30,5)$ task. Frontier LLMs are prompted with the ground-truth generation algorithm of the Dyck language in text form, so that they only need to execute the plan described in the prompt; see Appendix \ref{['app:dyck-details']} for the exact prompt and the evaluation protocol. Their performance drops rapidly as sequence length increases, consistent with the behavior reported by shojaee2025illusion2sinha2025illusion. Our model(FSRM), in contrast, is trained exclusively on raw strings of short length (shaded range, $\leq 40$). It maintains approximately $90\%$ accuracy over sequence lengths spanning four orders of magnitude beyond the training range.
  • Figure 2: Comparison of architectures. Transformers compute dense pairwise interactions in a single pass, while iterative variants such as looped transformers fan2024looped repeatedly update the representations through a recurrent layer. In contrast, RNNs/SSMs update hidden states strictly along the time axis. Our model(FSRM) performs multiple recurrent updates within each observation interval.
  • Figure 3: Egocentric maze examples. Models are trained on small mazes (a) and evaluated on larger mazes (b). The green cell denotes the start, the red cell denotes the goal, and the observation is always limited to a $7\times 7$ region centered on its current position.
  • Figure 4: Accuracy comparison of our model(FSRM) with baselines on the maze task. Our model(FSRM) shows substantially better OOD generalization.
  • Figure 5: Dyck examples. (a) The target at position $s$ is the token that closes the most recent unclosed bracket in the prefix up to $s$. When the stack is empty, "$*$" is predicted. (b) In a $1$-regular run, the predictor is required to output the token that closes the first open bracket (e.g., "$[$") at every odd step, while remembering this unresolved bracket for as long as the sequence continues.
  • ...and 12 more figures