Semi-Supervised Reward Modeling via Iterative Self-Training
Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han Zhao
TL;DR
Semi-Supervised Reward Modeling (SSRM) addresses the high data cost of training reward models for RLHF by leveraging unlabeled data through an iterative loop of pseudo-labeling, confidence thresholding, and supervised finetuning. Starting from a small labeled set, SSRM trains an initial reward model, then augmenting the data with high-confidence pseudo-labels from a larger unlabeled pool and refining the model with a SRM objective $\ell_{SRM}(\pi_\theta) = -\mathbb{E}_{(x,a_1,a_2,y)}[\log \pi_\theta(y|\mathbb{T}(x,a_1,a_2))]$; this process yields substantial gains across 0.4B–8B models and often approaches fully supervised performance with only a fraction of labeled data. Empirical results on RewardBench show improved calibration and higher confidence for correct predictions, and downstream alignment tasks (e.g., DPO) improve policy performance. SSRM thus offers a cost-effective and scalable pathway to high-quality reward models, broadening access to effective RLHF across model sizes.
Abstract
Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained large language models (LLMs). Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
