Table of Contents
Fetching ...

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M. Rush, Tri Dao

TL;DR

The paper tackles the problem of scaling test-time reasoning with long sequences by replacing transformer-centric inference with a memory-efficient hybrid linear RNN (M1) built on the Mamba architecture. It presents a three-stage pipeline—distillation from a Transformer to Mamba, math-focused supervised fine-tuning, and reinforcement-learning-based reasoning enhancement (GRPO)—to transfer reasoning capabilities while maintaining efficient inference. Empirical results on MATH500, AIME, and OlympiadBench show that M1 matches or approaches state-of-the-art reasoning models at a similar scale and achieves roughly a 3x throughput speedup over Transformer baselines, especially at large batch sizes with vLLM. The work demonstrates that scalable test-time generation, aided by self-consistency and longer-generation strategies, is feasible with subquadratic architectures and cross-architecture distillation, enabling practical reasoning at scale.

Abstract

Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

TL;DR

The paper tackles the problem of scaling test-time reasoning with long sequences by replacing transformer-centric inference with a memory-efficient hybrid linear RNN (M1) built on the Mamba architecture. It presents a three-stage pipeline—distillation from a Transformer to Mamba, math-focused supervised fine-tuning, and reinforcement-learning-based reasoning enhancement (GRPO)—to transfer reasoning capabilities while maintaining efficient inference. Empirical results on MATH500, AIME, and OlympiadBench show that M1 matches or approaches state-of-the-art reasoning models at a similar scale and achieves roughly a 3x throughput speedup over Transformer baselines, especially at large batch sizes with vLLM. The work demonstrates that scalable test-time generation, aided by self-consistency and longer-generation strategies, is feasible with subquadratic architectures and cross-architecture distillation, enabling practical reasoning at scale.

Abstract

Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

Paper Structure

This paper contains 23 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Inference latency when using prompt length 256 and decoding length 4096.
  • Figure 2: Inference latency when using batch size 128.
  • Figure 3: Number of samples vs. AIME25 accuracy (left) and generation time (seconds) vs. AIME25 accuracy (right). Both graphs include pass@1 and majority voting accuracies for M1 and DeepSeek-R1-Distill-Qwen-1.5B.
  • Figure 4: Generation length vs. AIME25 accuracy (left) and generation time (seconds) vs. AIME25 accuracy (right). Sampling for both models is done using a temperature of 0.8.
  • Figure 5: Pass@1 vs. maximum sequence length in GRPO training