Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning
Yifang Chen, Shuohang Wang, Ziyi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, Yelong Shen
TL;DR
This work tackles the data bottleneck in reinforcement learning from human feedback (RLHF) by proposing a cost-effective proxy reward oracle constructed from a small seed dataset and a limited expert budget. It introduces an on-policy query framework paired with active learning to generate and label far more preference data, enabling meaningful RLHF improvements with minimal queries. The method trains a weak evaluation model on a small EFT set to label unlabeled prompts, forms Direct Preference Optimization (DPO) pairs from on-policy data, and updates the policy accordingly, achieving measurable gains across AlpacaEval2 and MMLU benchmarks. The results demonstrate that an on-policy, budget-aware strategy can outperform off-policy and SPIN baselines, with practical implications for reducing annotation costs in large-scale RLHF systems.
Abstract
Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and expert query budgets}. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries. Using these methods, we train a evaluation model with minimal expert-labeled data, which then effectively labels nine times more preference pairs for further RLHF training. For instance, our model using Direct Preference Optimization (DPO) gains around over 1% average improvement on AlpacaEval2, MMLU-5shot and MMLU-0shot, with only 1.7K query cost. Our methodology is orthogonal to other direct expert query-based strategies and therefore might be integrated with them to further reduce query costs.
