Provably Efficient Online RLHF with One-Pass Reward Modeling
Long-Fei Li, Yu-Yang Qian, Peng Zhao, Zhi-Hua Zhou
TL;DR
The paper addresses the computational bottleneck of online RLHF by formulating RLHF as a contextual dueling bandit with linear rewards and introducing a one-pass reward modeling approach based on implicit online mirror descent. This method achieves constant-time, one-pass updates per iteration and eliminates the need to store historical data, while providing high-probability guarantees and improved estimation bounds compared to MLE. It applies to passive, active, and deployment-time RLHF settings, with tailored algorithms, regret/suboptimality guarantees, and practical implementations using Hessian-vector products and rejection sampling for uncertainty. Empirical results on Llama-3-8B-Instruct and Qwen2.5-7B-Instruct with Ultrafeedback and Mixture2 demonstrate stronger statistical efficiency and substantial computational savings, validating the approach for scalable, real-time alignment. Overall, the work advances the practicality of RLHF by delivering both theoretical guarantees and practical, scalable algorithms that reduce computation and storage while maintaining or improving alignment quality.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To this end, online RLHF has emerged as a promising direction, enabling iterative data collection and refinement. Despite its potential, this paradigm faces a key bottleneck: the requirement to continuously integrate new data into the dataset and re-optimize the model from scratch at each iteration, resulting in computational and storage costs that grow linearly with the number of iterations. In this work, we address this challenge by proposing a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration. Specifically, we first formalize RLHF as a contextual preference bandit and develop a new algorithm based on online mirror descent with a tailored local norm, replacing the standard maximum likelihood estimation for reward modeling. We then apply it to various online RLHF settings, including passive data collection, active data collection, and deployment-time adaptation. We provide theoretical guarantees showing that our method enhances both statistical and computational efficiency. Finally, we design practical algorithms for LLMs and conduct experiments with the Llama-3-8B-Instruct and Qwen2.5-7B-Instruct models on Ultrafeedback and Mixture2 datasets, validating the effectiveness of our approach.
