Table of Contents
Fetching ...

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning

Yifang Chen, Shuohang Wang, Ziyi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, Yelong Shen

TL;DR

This work tackles the data bottleneck in reinforcement learning from human feedback (RLHF) by proposing a cost-effective proxy reward oracle constructed from a small seed dataset and a limited expert budget. It introduces an on-policy query framework paired with active learning to generate and label far more preference data, enabling meaningful RLHF improvements with minimal queries. The method trains a weak evaluation model on a small EFT set to label unlabeled prompts, forms Direct Preference Optimization (DPO) pairs from on-policy data, and updates the policy accordingly, achieving measurable gains across AlpacaEval2 and MMLU benchmarks. The results demonstrate that an on-policy, budget-aware strategy can outperform off-policy and SPIN baselines, with practical implications for reducing annotation costs in large-scale RLHF systems.

Abstract

Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and expert query budgets}. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries. Using these methods, we train a evaluation model with minimal expert-labeled data, which then effectively labels nine times more preference pairs for further RLHF training. For instance, our model using Direct Preference Optimization (DPO) gains around over 1% average improvement on AlpacaEval2, MMLU-5shot and MMLU-0shot, with only 1.7K query cost. Our methodology is orthogonal to other direct expert query-based strategies and therefore might be integrated with them to further reduce query costs.

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning

TL;DR

This work tackles the data bottleneck in reinforcement learning from human feedback (RLHF) by proposing a cost-effective proxy reward oracle constructed from a small seed dataset and a limited expert budget. It introduces an on-policy query framework paired with active learning to generate and label far more preference data, enabling meaningful RLHF improvements with minimal queries. The method trains a weak evaluation model on a small EFT set to label unlabeled prompts, forms Direct Preference Optimization (DPO) pairs from on-policy data, and updates the policy accordingly, achieving measurable gains across AlpacaEval2 and MMLU benchmarks. The results demonstrate that an on-policy, budget-aware strategy can outperform off-policy and SPIN baselines, with practical implications for reducing annotation costs in large-scale RLHF systems.

Abstract

Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current large language model pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and expert query budgets}. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries. Using these methods, we train a evaluation model with minimal expert-labeled data, which then effectively labels nine times more preference pairs for further RLHF training. For instance, our model using Direct Preference Optimization (DPO) gains around over 1% average improvement on AlpacaEval2, MMLU-5shot and MMLU-0shot, with only 1.7K query cost. Our methodology is orthogonal to other direct expert query-based strategies and therefore might be integrated with them to further reduce query costs.
Paper Structure (58 sections, 1 equation, 7 figures, 5 tables)

This paper contains 58 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our cost-effective proxy reward oracle construction pipeline:Our main approach is shown as On-policy+AL that features two innovations: an on-policy query framework that uses $M_1$ generated data to query preferences and train the evaluation model $M^\text{eval}$, and (2) An active learning (AL) module that further aids in selecting $n \ll N$ budget informative data points. We also test Off-policy method, which is adapted from self-rewarding LM yuan2024selfreward. Unlike our on-policy query method, this approach queries the expert with seed SFT data and generally outperformed by On-policy+AL unless in the benign conditions. Note that our experiments build upon DPO training but this proxy oracle itself independent of the RLHF training method.
  • Figure 2: With fixed query budget $n$, performance of $M^\text{eval}$ across different numbers of unlabeled prompts $N$. Left: The initial 1700 responses of $\bm{X}$ generated by $M_1$ are evaluated by GPT (indicated by the black vertical line). We use 1500 of these EFT data to train weak evaluators (shown in gray) and reserve the remainder as validation data to select the optimal weak evaluator. The graph displays the performance of models trained on preference sets labeled by this weak evaluator across five metrics. Right: Similar to the left, but instead of using the entire 1500 EFT data to train the weak evaluator, we select a balanced subset of 200 EFT as previously described.
  • Figure 3: Training reward distribution for $\text{EFT}_\text{seed}$ versus $\text{EFT}_1$ in our experiment, highlighting the bias towards higher rewards in $\text{EFT}_\text{seed}$.
  • Figure 4: Previous preference data labeling pipelines. The figure depicts two methods, direct query and SPIN, both of which do not require proxy reward oracles. And thus the direct query demand high budget while SPIN is strictly outperforms by our methods when $m$ is small.
  • Figure 5: $M_2$ Performance vs Query Budget. The shade represent the square root of total variance.
  • ...and 2 more figures