Semi-Supervised Reward Modeling via Iterative Self-Training

Yifei He; Haoxiang Wang; Ziyan Jiang; Alexandros Papangelis; Han Zhao

Semi-Supervised Reward Modeling via Iterative Self-Training

Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao

TL;DR

Semi-Supervised Reward Modeling (SSRM) addresses the high data cost of training reward models for RLHF by leveraging unlabeled data through an iterative loop of pseudo-labeling, confidence thresholding, and supervised finetuning. Starting from a small labeled set, SSRM trains an initial reward model, then augmenting the data with high-confidence pseudo-labels from a larger unlabeled pool and refining the model with a SRM objective $\ell_{SRM}(\pi_\theta) = -\mathbb{E}_{(x,a_1,a_2,y)}[\log \pi_\theta(y|\mathbb{T}(x,a_1,a_2))]$; this process yields substantial gains across 0.4B–8B models and often approaches fully supervised performance with only a fraction of labeled data. Empirical results on RewardBench show improved calibration and higher confidence for correct predictions, and downstream alignment tasks (e.g., DPO) improve policy performance. SSRM thus offers a cost-effective and scalable pathway to high-quality reward models, broadening access to effective RLHF across model sizes.

Abstract

Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.

Semi-Supervised Reward Modeling via Iterative Self-Training

TL;DR

; this process yields substantial gains across 0.4B–8B models and often approaches fully supervised performance with only a fraction of labeled data. Empirical results on RewardBench show improved calibration and higher confidence for correct predictions, and downstream alignment tasks (e.g., DPO) improve policy performance. SSRM thus offers a cost-effective and scalable pathway to high-quality reward models, broadening access to effective RLHF across model sizes.

Abstract

Paper Structure (34 sections, 4 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 4 equations, 4 figures, 8 tables, 1 algorithm.

Introduction
Semi-Supervised Reward Modeling
Reward Model
Iterative Self-Training
Supervised training
Pseudo-labeling
Confidence thresholding
Model update
Experiments
Setup
Models
Datasets
Data Splitting
Evaluation
Benchmark Evaluation
...and 19 more sections

Figures (4)

Figure 1: Semi-Supervised Reward Modeling (SSRM) enhances the ability of a language model to predict preferences using both labeled and unlabeled data. Given a pretrained model $\pi_\theta^\text{pre}$, a small labeled dataset $D_l$ and a large unlabeled dataset $D_u$, we first perform supervised reward modeling (SRM) on $D_l$ to obtain the SRM model $\pi_\theta^0$. Then, at each step $t$, we perform three steps: (i) Pseudo-labeling: assign pseudo-labels to examples in $D_u$. (ii) Confidence thresholding: given a prompt $x$ and two responses $a_1, a_2$, if the prediction confidence exceeds a preset threshold, append it to the labeled dataset to obtain $D_t$. (iii) SRM on augmented data: finetune the model on $D_t$.
Figure 2: The Gemma-2B undergone three iterations of SSRM demonstrates better calibration, especially in the high-confidence score range, showing the effectiveness of confidence thresholding.
Figure 3: The prediction confidence noticeably improves on Gemma-2B models after three iterations of SSRM. Combined with better calibration, it shows the prediction more accurately reflects the actual outcome.
Figure 4: With more labeled data, the performance of SSRM consistently increases.

Semi-Supervised Reward Modeling via Iterative Self-Training

TL;DR

Abstract

Semi-Supervised Reward Modeling via Iterative Self-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (4)