Table of Contents
Fetching ...

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

Hongwei Chen, Yishu Lei, Dan Zhang, Bo Ke, Danxiang Zhu, Xuyi Chen, Yuxiang Lu, Zhengjie Huang, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

TL;DR

MatryoshkaThinking introduces a training-free, test-time reasoning framework that recursively samples, verifies, and summarizes multiple candidate solutions to improve reasoning under fixed compute budgets. By iteratively looping the summarize–verify cycle, the method leverages the model’s intrinsic generative, discriminative, and summarization capabilities to reduce errors and raise Pass@1 with far less compute than prior approaches (e.g., achieving 99.79 on AIME2025 with 42M tokens, about 4% of DeepConf’s budget). It demonstrates broad generality across text, vision-language, and audio-language benchmarks and across diverse model families, with ablation showing loops, self-verification, and summarization each contribute meaningfully. The findings offer a scalable paradigm for efficient inference in large language models and open avenues for integrating external tools to further enhance reliability.

Abstract

Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model's intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.

MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

TL;DR

MatryoshkaThinking introduces a training-free, test-time reasoning framework that recursively samples, verifies, and summarizes multiple candidate solutions to improve reasoning under fixed compute budgets. By iteratively looping the summarize–verify cycle, the method leverages the model’s intrinsic generative, discriminative, and summarization capabilities to reduce errors and raise Pass@1 with far less compute than prior approaches (e.g., achieving 99.79 on AIME2025 with 42M tokens, about 4% of DeepConf’s budget). It demonstrates broad generality across text, vision-language, and audio-language benchmarks and across diverse model families, with ablation showing loops, self-verification, and summarization each contribute meaningfully. The findings offer a scalable paradigm for efficient inference in large language models and open avenues for integrating external tools to further enhance reliability.

Abstract

Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model's intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.

Paper Structure

This paper contains 24 sections, 15 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance–efficiency trade-off on AIME2025, showing pass@1 against inference token usage for different test-time scaling strategies base on GPT-OSS-120B. Efficient Reasoning via MatryoshkaThinking lies in the upper-left region, reflecting higher performance at lower inference cost. Each query is sampled 32 times.
  • Figure 2: MatryoshkaThinking significantly enhances the Pass@1 accuracy of ERNIE-4.5-21B-A3B-Thinking on AIME2025, with performance converging toward Pass@32 as the number of recursive loop increases.
  • Figure 3: Overview of our MatryoshkaThinking. As shown in the figure, MatryoshkaThinking consists of three key components: self-verify, summary, and the iterative loop between them. With test-time scaling, recursive self-verify and summary identify promising solutions, ultimately consolidating them into a robust final solution.
  • Figure 4: The figure illustrates the accuracy of various models when tasked with summarizing context without any correct answers. The results reveal that stronger models like GPT-OSS-120B (achieving 84.61% accuracy) are able to correctly identify answers based on reasoning and error correction, while weaker models such as Qwen3-4B-Instruct-2507 (scoring 7.69%) show limited or no inductive reasoning ability. These findings highlight the critical role of the Summary phase in enhancing model performance through inductive reasoning and error correction, especially for stronger models.
  • Figure 5: The figure demonstrates that, as the proportion of correct answers in the summary context increases, most models performance improves. In particular, for stronger models like GPT-OSS-120B, the summary method outperforms majority voting, especially when the proportion of correct answers in the context is small. This ratio of correct answers simulates the model’s self-verify capability in real-world applications, highlighting that the model's verification ability is crucial for improving performance.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1: Probability of Correct Solution
  • Definition 2: Verification Process
  • Definition 3: Summarization Process