GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang; Yongyu Mu; Hang Zhou; Yifu Huo; Ziming Zhu; Jiali Zeng; Murun Yang; Bei Li; Xiaoyang Hao; Chunliang Zhang; Fandong Meng; Jingbo Zhu; Tong Xiao

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang, Yongyu Mu, Hang Zhou, Yifu Huo, Ziming Zhu, Jiali Zeng, Murun Yang, Bei Li, Xiaoyang Hao, Chunliang Zhang, Fandong Meng, Jingbo Zhu, Tong Xiao

TL;DR

GRAM-R^2 addresses the data bottleneck in reward modeling by introducing a self-training generative foundation reward model that outputs both preference labels and explicit reward rationales. It trains a dedicated preference-proving model to generate proofs and uses iterative self-training on unlabeled data to scale reward reasoning, enabling strong performance across response ranking, task adaptation, and RLHF with minimal task-specific fine-tuning. The approach outperforms discriminative and generative baselines, demonstrates robustness to best-of-$n$ sampling, and achieves notable data-efficient adaptation (e.g., 1K STEM data yielding large gains). This work shows that explicit reward reasoning can be learned from rationale-free labeled data and unlabeled data, providing a scalable path to generalist reward models for aligning LLMs with human preferences.

Abstract

Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

TL;DR

Abstract

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)