Table of Contents
Fetching ...

Scaling Reward Modeling without Human Supervision

Jingxuan Fan, Yueying Li, Zhenting Qi, Dinghuai Zhang, Kianté Brantley, Sham M. Kakade, Hanlin Zhang

TL;DR

Overall, this work demonstrates the feasibility and promise of training reward models without costly and potentially unreliable human annotations and operationalizes reward-based scaling, in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora.

Abstract

Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is demonstrated in various aspects: despite using no human annotations, training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, and these improvements consistently transfer across diverse initialization backbones spanning model families and scales. Across models, our method improves RewardBench v2 accuracy by up to +7.7 points on average, with gains of up to +16.1 on in-domain math subsets and consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed strong supervised reward model baselines of similar size. Overall, we demonstrate the feasibility and promise of training reward models without costly and potentially unreliable human annotations.

Scaling Reward Modeling without Human Supervision

TL;DR

Overall, this work demonstrates the feasibility and promise of training reward models without costly and potentially unreliable human annotations and operationalizes reward-based scaling, in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora.

Abstract

Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is demonstrated in various aspects: despite using no human annotations, training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, and these improvements consistently transfer across diverse initialization backbones spanning model families and scales. Across models, our method improves RewardBench v2 accuracy by up to +7.7 points on average, with gains of up to +16.1 on in-domain math subsets and consistent improvements on out-of-domain safety and general subsets. When applied to best-of-N selection and policy optimization, these reward models substantially improve downstream math performance and match or exceed strong supervised reward model baselines of similar size. Overall, we demonstrate the feasibility and promise of training reward models without costly and potentially unreliable human annotations.
Paper Structure (36 sections, 12 equations, 13 figures, 12 tables)

This paper contains 36 sections, 12 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Schematic overview of our reward model training workflow from web math text. Raw documents are split into prefix–suffix pairs, where the true continuation is treated as the chosen response and other in-batch continuations serve as implicit negatives. The reward model is trained with a Bradley–Terry objective over these online preference pairs, enabling scalable reward learning without human annotations.
  • Figure 2: Reward Model Training from Web Data
  • Figure 3: Scalability of our method with respect to data size. Reward models trained from scratch on Llama-3.2-3B improve steadily on RewardBench v1/v2 all subsets as token budget increases to 11M.
  • Figure 4: Effect of batch size on peak gains and learning trajectory of RewardBench v2.
  • Figure 5: Effect of dataset quality on peak gains and learning trajectory of RewardBench v2.
  • ...and 8 more figures