CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning
Huimu Yu, Xing Wu, Haotian Xu, Debing Zhang, Songlin Hu
TL;DR
CodePMP tackles the data bottleneck in reward-model finetuning by pretraining a preference model on millions of synthesized code-preference pairs derived from public GitHub code. This approach improves RM sample efficiency and downstream reasoning performance across mathematical and logical benchmarks (GSM8K, MATH, ReClor, LogiQA2.0), with robust cross-architecture generalization. By jointly training RM and LM components and leveraging large-scale code-derived data, CodePMP reduces reliance on expensive human annotations while enhancing Best-of-N selection and overall reasoning ability. The work demonstrates the practical impact of scalable, code-based PMP for broad LLM reasoning tasks and suggests promising directions for future refinements and broader RM applications.
Abstract
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
