Table of Contents
Fetching ...

Adaptive Loops and Memory in Transformers: Think Harder or Know More?

Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali

TL;DR

This work investigates transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage.

Abstract

Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline, with three times the number of layers, across math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.

Adaptive Loops and Memory in Transformers: Think Harder or Know More?

TL;DR

This work investigates transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage.

Abstract

Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline, with three times the number of layers, across math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.
Paper Structure (22 sections, 9 equations, 4 figures, 5 tables)

This paper contains 22 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Architecture overview.Left: A standard transformer passes hidden states through $L$ unique blocks. Center: Our loop model allows each block to iterate up to $N$ times, with a learned halting mechanism that produces a weighted combination of intermediate states. Per-step scales $\zeta(s_n)$ are initialized near zero for training stability. Right: The combined model additionally retrieves from local (per-layer) and global (shared) memory banks, gated by learned input-dependent scalars.
  • Figure 2: Expected number of loop iterations per layer over training.Left: Each curve represents one layer. Early layers (lighter colors) consistently use fewer iterations than later layers (darker colors). Middle: Expected steps at the end of training. Right: All models show a characteristic transition which occurs at approximately the same cross-entropy value across configurations (see Figure \ref{['fig:ce_transition']} in the Appendix for all configurations).
  • Figure 3: Expected loop iterations vs. validation cross-entropy for all configurations. Each point represents one evaluation during training; curves are colored by model configuration. Across all looped models, the expected number of iterations begins to increase rapidly once the cross-entropy drops below approximately $3.27 \pm 0.59$. This phase transition is consistent across Loop-3, Loop-5, and Loop-7 configurations, suggesting it depends on the model's language competence rather than the maximum number of allowed iterations.
  • Figure 4: Memory gate activations across layers and training.Left: Local memory gate values show high variance across layers while later layers tend to have higher gate activations, and the spread increases over training. Right: Global memory gate values increase during training but converge to a more uniform profile across layers, with activations rising up to approximately layer 5 and then plateauing.