Table of Contents
Fetching ...

Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and Turn

Chao Yu, Qixin Tan, Jiaxuan Gao, Shi Yu, Hong Lu, Xinting Yang, Zelai Xu, Yu Wang, Yi Wu, Eugene Vinitsky

TL;DR

The paper addresses the limits of test-time reasoning by proposing a unified 3D scaling framework that jointly extends context length, batch sampling, and turn-based refinement. By integrating context, batch aggregation, and iterative self-improvement, the approach substantially increases reasoning capacity beyond base model context windows, delivering gold-level results on IMO/CPHO and strong performance on IOI, with further gains when incorporating human feedback. Across math, physics, coding, and embodied robotics benchmarks, 3D scaling consistently surpasses single-dimension baselines, while the human-in-the-loop consistently yields the strongest improvements and enables open-ended embodied learning. These findings highlight the practical potential of multi-dimensional test-time scaling to enhance reasoning, code generation, and humanoid control, while also pointing to biases in aggregation and the need to explore additional scaling axes.

Abstract

Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.

Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and Turn

TL;DR

The paper addresses the limits of test-time reasoning by proposing a unified 3D scaling framework that jointly extends context length, batch sampling, and turn-based refinement. By integrating context, batch aggregation, and iterative self-improvement, the approach substantially increases reasoning capacity beyond base model context windows, delivering gold-level results on IMO/CPHO and strong performance on IOI, with further gains when incorporating human feedback. Across math, physics, coding, and embodied robotics benchmarks, 3D scaling consistently surpasses single-dimension baselines, while the human-in-the-loop consistently yields the strongest improvements and enables open-ended embodied learning. These findings highlight the practical potential of multi-dimensional test-time scaling to enhance reasoning, code generation, and humanoid control, while also pointing to biases in aggregation and the need to explore additional scaling axes.

Abstract

Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.

Paper Structure

This paper contains 53 sections, 1 theorem, 16 equations, 8 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Given an LLM policy $\pi_\theta$, an input question $x\in\mathcal{X}$, and a unique ground-truth answer $a^*$. If some incorrect answer $\tilde{a}$ has strictly higher probability of being produced by the LLM than $a^*$, then the accuracy of Majority Voting approaches zero as the batch size grows. F

Figures (8)

  • Figure 1: Illustration of Test-time Scaling across three dimensions: context, batch, and turn.
  • Figure 2: The average accuracy over the IMO2025 dataset as a function of the total thinking budget for individual scaling on three dimensions: context, batch and turn. All three scaling methods achieve substantial improvements at small scales but saturate as the scale becomes larger.
  • Figure 3: Batch scaling analysis on IMO problems. (a) shows the average accuracy of batch scaling (majority vote) on each IMO problem with different batch sizes. (b) illustrates the failure mode of majority vote observed in IMO3. The model produces both the correct answer "4" and the incorrect answer "2". As batch size $B$ increases, the probability of selecting the distractor "2" grows due to model bias.
  • Figure 4: The average accuracy over the IMO2025 dataset as a function of the total thinking budget for individual scaling and 3D Scaling with different batch sizes. 3D Scaling achieves performance beyond the limits of individual scaling, reaching 73.3%. The red marker denotes 3D Scaling with a human judge, which attains 86.7% accuracy, highlighting the effectiveness of human feedback.
  • Figure 5: Comprehensive comparison of different test-time scaling methods across four domains: Math Olympics (IMO2025), Physics Olympics (CPHO2022), Coding (IOI2025), and Embodied (IsaacGym). Each dimension in the radar charts represents a single task or problem and is normalized by the best-performing method on that specific dimension. 3D Scaling with a human judge consistently outperforms baseline methods including context scaling, turn scaling, and batch scaling, across different benchmarks. 3D Scaling with LLM judge also achieves competitive results on the IMO 2025 and CPHO 2022 benchmarks, but performs worse than 3D Scaling with a human judge on the challenging programming task. Results for IMO6 and CPHO4 are excluded due to zero accuracy across all methods. (Since IMO2 is a fully proof problem, it is impossible to do Gemini vote for a proof process. So we have not done the Batch Scaling(Vote) experiment for IMO2. )
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1: Systematic Bias Amplification under Majority Voting
  • proof
  • proof : proof of Theorem 1