Table of Contents
Fetching ...

Understanding Dynamic Compute Allocation in Recurrent Transformers

Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin

TL;DR

This work tackles token-level adaptive computation by introducing a complexity-controlled evaluation paradigm and a unified recurrent Transformer, ANIRA, that supports per-token variable-depth computation. It differentiates two decision mechanisms—ANIRA-E (early depth allocation) and ANIRA-O (online halting)—and analyzes their impact on learnable compute policies using both algorithmic and synthetic language tasks. The results show that compute allocation can align with task complexity without explicit difficulty supervision, but such alignment does not guarantee extrapolation to unseen input sizes, and the two decision modes develop qualitatively different strategies tied to structural cues versus algorithmic state. The study also reveals a two-phase training dynamic (learning followed by compute reduction) and highlights the interpretability implications of the different allocation policies, suggesting avenues for complexity-aware benchmarking and biasing induction toward algorithmic structure in adaptive computation systems.

Abstract

Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

Understanding Dynamic Compute Allocation in Recurrent Transformers

TL;DR

This work tackles token-level adaptive computation by introducing a complexity-controlled evaluation paradigm and a unified recurrent Transformer, ANIRA, that supports per-token variable-depth computation. It differentiates two decision mechanisms—ANIRA-E (early depth allocation) and ANIRA-O (online halting)—and analyzes their impact on learnable compute policies using both algorithmic and synthetic language tasks. The results show that compute allocation can align with task complexity without explicit difficulty supervision, but such alignment does not guarantee extrapolation to unseen input sizes, and the two decision modes develop qualitatively different strategies tied to structural cues versus algorithmic state. The study also reveals a two-phase training dynamic (learning followed by compute reduction) and highlights the interpretability implications of the different allocation policies, suggesting avenues for complexity-aware benchmarking and biasing induction toward algorithmic structure in adaptive computation systems.

Abstract

Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.
Paper Structure (42 sections, 8 equations, 10 figures, 10 tables)

This paper contains 42 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Two variants of ANIRA: (a) ANIRA-E: depth allocation decided from a shallow (pre-recurrence) representation. (b) ANIRA-O: online halting decisions are made for each token between each recurrent layer to decide whether to continue or halt adaptive compute. Once a token has completed its allocated number of recurrent steps, its representation is frozen and subsequent iterations act as identity mappings for that token.
  • Figure 2: Task complexity vs mean depth allocation. Both ANIRA variants allocate compute consistent with task complexity. In both these cases, ANIRA-O is able to choose fewer recurrent steps for the same task performance, compared to ANIRA-E.
  • Figure 3: CLRS task complexity vs accuracy (top) and compute allocation $\bar{d}$ (bottom). Green markers indicate input sizes seen during training. ANIRA compute allocation tracks task complexity. However, we observe that task accuracy drops sharply at input sizes not covered in training set, indicating interpolation and extrapolation failure.
  • Figure 4: MANO Training Dynamics: ANIRA learn tasks in easy-to-hard order and learning happens in two phases: learning and compute reduction.
  • Figure 5: BREVO Training Dynamics: ANIRA learns tasks in easy-to-hard order and learning happens in two phases: learning and compute reduction.
  • ...and 5 more figures