Table of Contents
Fetching ...

Dynamic Scaling of Unit Tests for Code Reward Modeling

Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang

TL;DR

The paper investigates how scaling unit tests affects the quality of rewards used to identify correct code solutions from LLMs. It introduces CodeRM-8B, a compact unit-test generator trained via a synthetic data pipeline, and a dynamic scaling mechanism that allocates computation based on problem difficulty to improve efficiency. Pioneer experiments demonstrate that more unit tests generally yield better reward signals, especially for harder problems, and dynamic scaling further boosts performance under fixed budgets. Across three benchmarks and multiple policy/reward configurations, CodeRM-8B and dynamic scaling achieve substantial gains, including up to 18.4% improvements on HumanEval Plus for smaller models and notable gains for larger and proprietary models.

Abstract

Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).

Dynamic Scaling of Unit Tests for Code Reward Modeling

TL;DR

The paper investigates how scaling unit tests affects the quality of rewards used to identify correct code solutions from LLMs. It introduces CodeRM-8B, a compact unit-test generator trained via a synthetic data pipeline, and a dynamic scaling mechanism that allocates computation based on problem difficulty to improve efficiency. Pioneer experiments demonstrate that more unit tests generally yield better reward signals, especially for harder problems, and dynamic scaling further boosts performance under fixed budgets. Across three benchmarks and multiple policy/reward configurations, CodeRM-8B and dynamic scaling achieve substantial gains, including up to 18.4% improvements on HumanEval Plus for smaller models and notable gains for larger and proprietary models.

Abstract

Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).
Paper Structure (38 sections, 14 equations, 9 figures, 4 tables)

This paper contains 38 sections, 14 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Scaling the quantities of unit tests for majority voting leads to improvements in performance across different policy models and reward models. Policy refers to the model that produces code solutions, while reward denotes the model that generates unit tests.
  • Figure 2: The correlation between the quantities of unit tests and the performance on different unit test generators (reward model) with $200$ candidate code solutions.
  • Figure 3: The improvements of best-of-N performance on problems of different difficulties. Quintile 1 (easiest) has the highest pass rate, while Quintile 2 (hardest) has the lowest pass rate. Scaling the quantity of unit tests significantly improves the accuracy on more complex problems.
  • Figure 4: Overview for efficient and high-quality unit test scaling. First, we train a lightweight unit test generator based on high-quality synthetic data. Subsequently, we employ dynamic unit test scaling to further improve efficiency.
  • Figure 5: The performance of three different unit test generators (reward model) on different quantities of unit tests, while employing Llama3-8B as the policy model.
  • ...and 4 more figures