Table of Contents
Fetching ...

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang, Yongyu Mu, Hang Zhou, Yifu Huo, Ziming Zhu, Jiali Zeng, Murun Yang, Bei Li, Xiaoyang Hao, Chunliang Zhang, Fandong Meng, Jingbo Zhu, Tong Xiao

TL;DR

GRAM-R^2 addresses the data bottleneck in reward modeling by introducing a self-training generative foundation reward model that outputs both preference labels and explicit reward rationales. It trains a dedicated preference-proving model to generate proofs and uses iterative self-training on unlabeled data to scale reward reasoning, enabling strong performance across response ranking, task adaptation, and RLHF with minimal task-specific fine-tuning. The approach outperforms discriminative and generative baselines, demonstrates robustness to best-of-$n$ sampling, and achieves notable data-efficient adaptation (e.g., 1K STEM data yielding large gains). This work shows that explicit reward reasoning can be learned from rationale-free labeled data and unlabeled data, providing a scalable path to generalist reward models for aligning LLMs with human preferences.

Abstract

Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

TL;DR

GRAM-R^2 addresses the data bottleneck in reward modeling by introducing a self-training generative foundation reward model that outputs both preference labels and explicit reward rationales. It trains a dedicated preference-proving model to generate proofs and uses iterative self-training on unlabeled data to scale reward reasoning, enabling strong performance across response ranking, task adaptation, and RLHF with minimal task-specific fine-tuning. The approach outperforms discriminative and generative baselines, demonstrates robustness to best-of- sampling, and achieves notable data-efficient adaptation (e.g., 1K STEM data yielding large gains). This work shows that explicit reward reasoning can be learned from rationale-free labeled data and unlabeled data, providing a scalable path to generalist reward models for aligning LLMs with human preferences.

Abstract

Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

Paper Structure

This paper contains 56 sections, 12 equations, 15 figures, 9 tables, 1 algorithm.

Figures (15)

  • Figure 1: Architecture of the Generative Reward Model. The generative reward model utilizes a pre-trained LLM to predict a preference label from a given prompt directly. Optionally, it can incorporate reward reasoning before generating the final preference label prediction.
  • Figure 2: An overview of the self-training approach for GRAM-R$^2$. The process begins by training a preference-proving model on a small, rationale-based seed dataset of approximately 40.5K examples. This model is then used to synthesize rationales for a larger, rationale-free labeled dataset of 1M examples, which in turn is used to train the initial GRAM-R$^2$ model. Subsequently, GRAM-R$^2$ undergoes three iterations of self-training, using a new batch of 0.5M unlabeled examples in each iteration.
  • Figure 3: Best-of-$n$ sampling performance curves for GRAM-R$^2$ and strong baseline models on the PPE benchmark. "D-Baseline" and "G-Baseline" refer to discriminative and generative reward models, respectively, trained on the same labeled preference data. "Ground Truth" represents an oracle reward model that selects responses based on gold-truth answers. All results are reported using the LLaMA-3.1-8B-Instruct backbone.
  • Figure 4: The performance of reward models fine-tuned with varying amounts of task-specific data (STEM and code generation).
  • Figure 5: Performance scaling with different amounts of training data used to pre-train GRAM-R$^2$. "0M" denotes the setting where GRAM-R$^2$ is trained solely during the fine-tuning stage, without any pre-training on rationale-free labeled data or unlabeled data. RFD: Rationale-Free Labeled Data; UD: Unlabeled Data.
  • ...and 10 more figures