Table of Contents
Fetching ...

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

Huimu Yu, Xing Wu, Haotian Xu, Debing Zhang, Songlin Hu

TL;DR

CodePMP tackles the data bottleneck in reward-model finetuning by pretraining a preference model on millions of synthesized code-preference pairs derived from public GitHub code. This approach improves RM sample efficiency and downstream reasoning performance across mathematical and logical benchmarks (GSM8K, MATH, ReClor, LogiQA2.0), with robust cross-architecture generalization. By jointly training RM and LM components and leveraging large-scale code-derived data, CodePMP reduces reliance on expensive human annotations while enhancing Best-of-N selection and overall reasoning ability. The work demonstrates the practical impact of scalable, code-based PMP for broad LLM reasoning tasks and suggests promising directions for future refinements and broader RM applications.

Abstract

Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

TL;DR

CodePMP tackles the data bottleneck in reward-model finetuning by pretraining a preference model on millions of synthesized code-preference pairs derived from public GitHub code. This approach improves RM sample efficiency and downstream reasoning performance across mathematical and logical benchmarks (GSM8K, MATH, ReClor, LogiQA2.0), with robust cross-architecture generalization. By jointly training RM and LM components and leveraging large-scale code-derived data, CodePMP reduces reliance on expensive human annotations while enhancing Best-of-N selection and overall reasoning ability. The work demonstrates the practical impact of scalable, code-based PMP for broad LLM reasoning tasks and suggests promising directions for future refinements and broader RM applications.

Abstract

Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
Paper Structure (51 sections, 3 equations, 13 figures, 17 tables)

This paper contains 51 sections, 3 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Compared to directly finetuning reward models, CodePMP significantly improves the sample efficiency and capability of reward models, which in turn boosts the generator's(MetaMath-Mistral-7B) reasoning performance (Best-of-N accuracy) across both mathematical reasoning tasks (GSM8K and MATH) and logical reasoning tasks (ReClor and LogiQA2.0).
  • Figure 2: Overview of CodePMP. First, raw code collected from GitHub is cleaned and summarized into code prompts (descriptions). Then, a weak CodeLLM generates rejected responses while a stronger CodeLLM produces chosen responses. Finally, these millions of $\langle \textit{chosen}, \textit{rejected} \rangle$ pairs form the preference model pretraining dataset, enhancing both sample efficiency and performance for downstream reasoning tasks.
  • Figure 3: Best-of-N accuracy comparison: CodePMP-initialized models outperform baselines across various N values, showing superior ranking capabilities.
  • Figure 4: Sample efficiency comparison for 7B models: CodePMP-initialized reward models achieve higher Best-of-N accuracy with the equivalent sample sizes, showing better data efficiency. Horizontal axis scales by $\sqrt{2}$. Green: with CodePMP; Blue: without CodePMP.
  • Figure 5: Scaling analysis of CodePMP for 7B models: more code-preference pairs consistently improve Best-of-N accuracy across reasoning tasks without diminishing returns. Horizontal axis scales by $\sqrt{2}$; gray dashed lines show baseline performance without CodePMP.
  • ...and 8 more figures