Table of Contents
Fetching ...

Pre-Trained Policy Discriminators are General Reward Models

Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, Songyang Gao, Chengqi Lv, Enyu Zhou, Honglin Guo, Zhiheng Xi, Wenwei Zhang, Qipeng Guo, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Tao Gui, Kai Chen

TL;DR

This work reframes reward modeling as a policy discriminator and introduces POLAR, a two-stage framework that pre-trains RM to distinguish identical vs different policies via distributional alignment and a BT loss, then fine-tunes with human criteria. The pre-trained RM supplies robust, criterion-agnostic reward signals that, when used in RLHF with Reinforcement Fine-Tuning (RFT), yield superior policy performance across multiple benchmarks and tasks. POLAR demonstrates strong scaling laws, with both model size and compute yielding predictable performance gains, and ablations show pre-training is critical for RLHF success. The approach offers a scalable, generalizable path toward stronger reward models and more reliable RLHF in open-ended domains.

Abstract

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

Pre-Trained Policy Discriminators are General Reward Models

TL;DR

This work reframes reward modeling as a policy discriminator and introduces POLAR, a two-stage framework that pre-trains RM to distinguish identical vs different policies via distributional alignment and a BT loss, then fine-tunes with human criteria. The pre-trained RM supplies robust, criterion-agnostic reward signals that, when used in RLHF with Reinforcement Fine-Tuning (RFT), yield superior policy performance across multiple benchmarks and tasks. POLAR demonstrates strong scaling laws, with both model size and compute yielding predictable performance gains, and ablations show pre-training is critical for RLHF success. The approach offers a scalable, generalizable path toward stronger reward models and more reliable RLHF in open-ended domains.

Abstract

We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.

Paper Structure

This paper contains 54 sections, 13 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Comparison of three reward modeling methods: (1) traditional methods incorporate absolute preferences into RMs, which directly assess the quality of trajectories; (2) rule-based verifier validate the candidate trajectory through the gold answer and predefined rules; (3) POLAR pre-trains an RM to recognize identical policies and discriminate different ones, enabling it to measure the difference in trajectories between a training policy and a target policy with desired behaviors.
  • Figure 2: Overview of Policy Discriminative Learning (POLAR). Stage 1: In pre-training, the RM learns criterion-agnostic policy differences by assigning higher rewards to trajectory pairs from consistent policies. Stage 2: During fine-tuning, human annotators rank trajectories from the same policy, implicitly defining human criteria, to align RM evaluations with human standards. Usage: In Reinforcement Fine-Tuning (RFT), the fine-tuned RM provides reward signals comparing candidate trajectories with human-preferred references, guiding policy training toward desired behaviors.
  • Figure 3: Comparison of POLAR and baselines on human preference prediction.
  • Figure 4: Scaling laws in POLAR. Validation loss vs. (left) model parameters $N$ and (right) optimal training compute $C$. Dashed lines show the power-law fit, with $R^2 = 0.9886$ (left) and $R^2 = 0.9912$ (right). Results show a predictable decrease in validation loss as model size or compute increases.
  • Figure 5: Scaling law of the learning rate with respect to model size and data scale in pre-training.
  • ...and 5 more figures