Table of Contents
Fetching ...

Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

Qianfan Zhang, Tianyu Guo, Xuandi Ren, Jiale Chen, Ming Ding, Ran Xin, Xia Xiao

Abstract

We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

Abstract

We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

Paper Structure

This paper contains 15 sections, 7 equations, 8 figures.

Figures (8)

  • Figure 1: Log-linear trend: validation accuracy scales linearly with the logarithm of the average number of reasoning tokens during RL training. Each point corresponds to a successive RL training checkpoint.
  • Figure 2: Randomized clipping replaces the hard reward cliff (left) with a smooth ramp (middle), producing a steeper log-linear scaling curve (right).
  • Figure 3: Training pipeline: SFT cold start with generation and verification trajectories, followed by verification RL and generation RL.
  • Figure 4: Verification RL warmup. Left: recall remains high while precision and accuracy improve during verification RL. Right: initializing generation RL from the verification checkpoint shifts the log-linear scaling curve upward.
  • Figure 5: The parallel thinking inference pipeline. The system spawns $N$ independent threads. Each thread generates a candidate solution, then verifies it via $V$ independent sampling calls, each producing a correctness judgment and reasoning. If all $V$ verdicts unanimously deem the solution correct, the thread terminates early; otherwise, one negative verdict is randomly selected and the model refines the solution conditioned on the previous attempt and the selected reasoning. This verify-refine loop repeats for up to $M$ rounds. After all threads complete, all solutions across threads and rounds are ranked by verification vote count, and the highest-scoring one is returned.
  • ...and 3 more figures