Table of Contents
Fetching ...

Tiny Recursive Reasoning with Mamba-2 Attention Hybrid

Wenlong Wang, Fergal Reid

TL;DR

This work asks whether Mamba-2 state-space hybrids can enter the recursive reasoning space of Tiny Recursive Model (TRM) without sacrificing its reasoning ability. By replacing Transformer blocks with Mamba-2 hybrid operators in a parameter-matched TRM, the authors achieve competitive and improved results on ARC-AGI-1, notably +2.0 percentage points in pass@2 (45.88% vs 43.88%) and +4–5 percentage points at higher K, while preserving pass@1. Across Sudoku-Extreme and Maze, results reveal task-dependent strengths for Mamba-2 hybrids, with dense cross-position mixing excelling on small grids and sequential Mamba-2 processing contributing diversity on larger ones. The study demonstrates that Mamba-2 hybrids can preserve latent recursive reasoning and expand candidate coverage, validating SSM-based operators as viable recursive design choices and motivating further work to internalize recursion into inner SSM state updates. The results emphasize the role of post-norm stability in recursive computation and point toward optimized operator mixing strategies for latent recursion.

Abstract

Recent work on recursive reasoning models like TRM demonstrates that tiny networks (7M parameters) can achieve strong performance on abstract reasoning tasks through latent recursion -- iterative refinement in hidden representation space without emitting intermediate tokens. This raises a natural question about operator choice: Mamba-2's state space recurrence is itself a form of iterative refinement, making it a natural candidate for recursive reasoning -- but does introducing Mamba-2 into the recursive scaffold preserve reasoning capability? We investigate this by replacing the Transformer blocks in TRM with Mamba-2 hybrid operators while maintaining parameter parity (6.83M vs 6.86M parameters). On ARC-AGI-1, we find that the hybrid improves pass@2 (the official metric) by +2.0\% (45.88\% vs 43.88\%) and consistently outperforms at higher K values (+4.75\% at pass@100), whilst maintaining pass@1 parity. This suggests improved candidate coverage -- the model generates correct solutions more reliably -- with similar top-1 selection. Our results validate that Mamba-2 hybrid operators preserve reasoning capability within the recursive scaffold, establishing SSM-based operators as viable candidates in the recursive operator design space and taking a first step towards understanding the best mixing strategies for recursive reasoning.

Tiny Recursive Reasoning with Mamba-2 Attention Hybrid

TL;DR

This work asks whether Mamba-2 state-space hybrids can enter the recursive reasoning space of Tiny Recursive Model (TRM) without sacrificing its reasoning ability. By replacing Transformer blocks with Mamba-2 hybrid operators in a parameter-matched TRM, the authors achieve competitive and improved results on ARC-AGI-1, notably +2.0 percentage points in pass@2 (45.88% vs 43.88%) and +4–5 percentage points at higher K, while preserving pass@1. Across Sudoku-Extreme and Maze, results reveal task-dependent strengths for Mamba-2 hybrids, with dense cross-position mixing excelling on small grids and sequential Mamba-2 processing contributing diversity on larger ones. The study demonstrates that Mamba-2 hybrids can preserve latent recursive reasoning and expand candidate coverage, validating SSM-based operators as viable recursive design choices and motivating further work to internalize recursion into inner SSM state updates. The results emphasize the role of post-norm stability in recursive computation and point toward optimized operator mixing strategies for latent recursion.

Abstract

Recent work on recursive reasoning models like TRM demonstrates that tiny networks (7M parameters) can achieve strong performance on abstract reasoning tasks through latent recursion -- iterative refinement in hidden representation space without emitting intermediate tokens. This raises a natural question about operator choice: Mamba-2's state space recurrence is itself a form of iterative refinement, making it a natural candidate for recursive reasoning -- but does introducing Mamba-2 into the recursive scaffold preserve reasoning capability? We investigate this by replacing the Transformer blocks in TRM with Mamba-2 hybrid operators while maintaining parameter parity (6.83M vs 6.86M parameters). On ARC-AGI-1, we find that the hybrid improves pass@2 (the official metric) by +2.0\% (45.88\% vs 43.88\%) and consistently outperforms at higher K values (+4.75\% at pass@100), whilst maintaining pass@1 parity. This suggests improved candidate coverage -- the model generates correct solutions more reliably -- with similar top-1 selection. Our results validate that Mamba-2 hybrid operators preserve reasoning capability within the recursive scaffold, establishing SSM-based operators as viable candidates in the recursive operator design space and taking a first step towards understanding the best mixing strategies for recursive reasoning.
Paper Structure (18 sections, 4 equations, 3 figures, 3 tables)

This paper contains 18 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture comparison: (a) TR-mamba2attn with Mamba-2 hybrid operator and (b) TRM-attn with attention-based operator. Both use post-norm residual connections (norm&add) between components.
  • Figure 2: Training curves for ARC-AGI-1 across all pass@K metrics. The hybrid (TR-mamba2attn, orange) consistently outperforms the baseline (TRM-attn, blue) at pass@2 (the official metric) and higher K values throughout training, whilst maintaining pass@1 parity. The gap emerges early and remains stable, demonstrating that improved candidate coverage is a consistent property rather than a late-training phenomenon.
  • Figure 3: Evaluation statistics on ARC-AGI-1 comparing TR-mamba2attn (hybrid) and TRM-attn (baseline). The hybrid generates more unique candidates per puzzle and exhibits higher vote entropy (indicating diverse exploration), whilst TRM-attn shows higher vote concentration on the top-1 candidate and larger top-1 margin (indicating decisive selection). These statistics provide quantitative evidence for the coverage vs selection trade-off observed in pass@K curves.