Table of Contents
Fetching ...

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain

TL;DR

This paper introduces Recursive Self-Aggregation (RSA), a hybrid test-time scaling framework that treats reasoning as an evolutionary process by maintaining a population of candidate solutions and recursively aggregating subsets to produce improved reasoning chains. RSA blends parallel exploration with sequential refinement, enabling deeper thinking without external verifiers and leveraging reasoning traces rather than solely final answers. The authors show RSA delivers substantial Pass@1 gains across math, code, reasoning, and knowledge tasks and across diverse model families, bridging the gap between small instruction-tuned models and larger reasoning models. They further demonstrate aggregation-aware reinforcement learning to train models to aggregate solutions, yielding additional performance gains. The work provides practical guidance for deploying deeper test-time thinking under compute budgets and suggests promising directions for future improvements such as explicit fitness functions and multi-step aggregation policies.

Abstract

Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at https://github.com/HyperPotatoNeo/RSA.

Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models

TL;DR

This paper introduces Recursive Self-Aggregation (RSA), a hybrid test-time scaling framework that treats reasoning as an evolutionary process by maintaining a population of candidate solutions and recursively aggregating subsets to produce improved reasoning chains. RSA blends parallel exploration with sequential refinement, enabling deeper thinking without external verifiers and leveraging reasoning traces rather than solely final answers. The authors show RSA delivers substantial Pass@1 gains across math, code, reasoning, and knowledge tasks and across diverse model families, bridging the gap between small instruction-tuned models and larger reasoning models. They further demonstrate aggregation-aware reinforcement learning to train models to aggregate solutions, yielding additional performance gains. The work provides practical guidance for deploying deeper test-time thinking under compute budgets and suggests promising directions for future improvements such as explicit fitness functions and multi-step aggregation policies.

Abstract

Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at https://github.com/HyperPotatoNeo/RSA.

Paper Structure

This paper contains 49 sections, 21 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Recursive Self-Aggregation (RSA, \ref{['sec:rsa']}) substantially improves Pass@$1$ across tasks and model architectures. RSA enables the much smaller Qwen3-4B-Instruct-2507 to match the performance of larger reasoning models such as DeepSeek-R1 and o3-mini (high). These gains are further amplified through our proposed aggregation-aware RL framework (\ref{['sec:rl']}).
  • Figure 2: Overview of test-time scaling control flows.Parallel methods generate multiple candidates and select using a verification mechanism. Sequential methods iteratively refines a chain, correcting previous mistakes. Hybrid methods combine parallel branching with sequential refinement.
  • Figure 3: RSA generates a population of $N$ solutions for a given prompt and recursively updates them over $T$ steps. Each update step subsamples $K$ distinct solutions from the current population and generates an improved solution with the aggregation prompt. See \ref{['app:algorithm']} for algorithm pseudo-code.
  • Figure 4: RSA significantly improves Pass@$1$ across math, code, general reasoning, and knowledge recall tasks. We observe consistent gains across diverse model families, including standard instruction-tuned models and long CoT "thinking" models. Further details provided in \ref{['app:models_used']}.
  • Figure 5: Pass@$1$ vs. RSA steps, for fixed population size $N=16$, using Qwen3-4B-Instruct-2507. Error bands indicate standard deviation over 4 seeds. Larger $K$ generally improves performance.
  • ...and 4 more figures