Table of Contents
Fetching ...

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

TL;DR

The proposed CodeScaler is an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

TL;DR

The proposed CodeScaler is an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency.

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).
Paper Structure (39 sections, 8 equations, 4 figures, 11 tables)

This paper contains 39 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Left: Training-Time Comparison shows that despite their larger data scale, synthetic code datasets exhibit a clear performance gap compared to verified code contest problems in RL training. While reward models provide dense supervision, they do not integrate effectively with RL, resulting in weaker performance than RLVR. Right: Test-Time Comparison illustrates that Unit Test TTS methods and off-the-shelf reward models demonstrate a clear performance–latency trade-off. This motivates us to develop a reward model that is both effective and efficient for RL training and test-time scaling.
  • Figure 2: Overall Pipeline of CodeScaler for training-time and test-time scaling, which provides execution-free rewards for policy optimization during RL training, and serves as a lightweight sampler for Best-of-N selection, without relying on test-case execution.
  • Figure 3: Comparison of Best-of-N (BoN@8) performance across five code generation benchmarks using different test-time scaling methods. CodeScaler consistently outperforms other reward models and achieves performance comparable to CURE.
  • Figure 4: Ablation study on CodeScaler components. Results compare RL performance using different reward model variants.