Understanding Dynamic Compute Allocation in Recurrent Transformers

Ibraheem Muhammad Moosa; Suhas Lohit; Ye Wang; Moitreya Chatterjee; Wenpeng Yin

Understanding Dynamic Compute Allocation in Recurrent Transformers

Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin

TL;DR

This work tackles token-level adaptive computation by introducing a complexity-controlled evaluation paradigm and a unified recurrent Transformer, ANIRA, that supports per-token variable-depth computation. It differentiates two decision mechanisms—ANIRA-E (early depth allocation) and ANIRA-O (online halting)—and analyzes their impact on learnable compute policies using both algorithmic and synthetic language tasks. The results show that compute allocation can align with task complexity without explicit difficulty supervision, but such alignment does not guarantee extrapolation to unseen input sizes, and the two decision modes develop qualitatively different strategies tied to structural cues versus algorithmic state. The study also reveals a two-phase training dynamic (learning followed by compute reduction) and highlights the interpretability implications of the different allocation policies, suggesting avenues for complexity-aware benchmarking and biasing induction toward algorithmic structure in adaptive computation systems.

Abstract

Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

Understanding Dynamic Compute Allocation in Recurrent Transformers

TL;DR

Abstract

Paper Structure (42 sections, 8 equations, 10 figures, 10 tables)

This paper contains 42 sections, 8 equations, 10 figures, 10 tables.

Introduction
Related work
Adaptive compute recurrent transformers
ANIRA Architecture
Training and Inference
Training Objective
Passthrough Mechanism
Depth Selection at Training and Inference
Allocation-Aware KV Caching
Compute and Memory Savings During Inference
Difference in compute allocation policies between ANIRA-E and ANIRA-O
Activity pattern and execution cost.
Complexity Controlled Evaluation Protocol
Algorithmic Tasks
Synthetic Language
...and 27 more sections

Figures (10)

Figure 1: Two variants of ANIRA: (a) ANIRA-E: depth allocation decided from a shallow (pre-recurrence) representation. (b) ANIRA-O: online halting decisions are made for each token between each recurrent layer to decide whether to continue or halt adaptive compute. Once a token has completed its allocated number of recurrent steps, its representation is frozen and subsequent iterations act as identity mappings for that token.
Figure 2: Task complexity vs mean depth allocation. Both ANIRA variants allocate compute consistent with task complexity. In both these cases, ANIRA-O is able to choose fewer recurrent steps for the same task performance, compared to ANIRA-E.
Figure 3: CLRS task complexity vs accuracy (top) and compute allocation $\bar{d}$ (bottom). Green markers indicate input sizes seen during training. ANIRA compute allocation tracks task complexity. However, we observe that task accuracy drops sharply at input sizes not covered in training set, indicating interpolation and extrapolation failure.
Figure 4: MANO Training Dynamics: ANIRA learn tasks in easy-to-hard order and learning happens in two phases: learning and compute reduction.
Figure 5: BREVO Training Dynamics: ANIRA learns tasks in easy-to-hard order and learning happens in two phases: learning and compute reduction.
...and 5 more figures

Understanding Dynamic Compute Allocation in Recurrent Transformers

TL;DR

Abstract

Understanding Dynamic Compute Allocation in Recurrent Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (10)