Greedy Sampling Is Provably Efficient for RLHF
Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
TL;DR
The paper addresses RLHF under KL-regularized contextual bandits with preference feedback, extending theoretical guarantees to the general preference model and the BT special case. It introduces a Greedy Sampling algorithm that uses empirical estimates directly, exploiting KL-regularization to bound optimal policies relative to a reference policy. The authors prove online regret bounds of order $O( ext{exp}( olinebreak \eta) imes d(ullet) imes ext{log}(T))$ and offline sample complexities of $O(1/ olinebreak epsilon)$ for both GP and BT models, while avoiding confidence-bound constructions. Empirical results corroborate the theory, showing that greedy sampling achieves comparable performance to optimism-based methods with reduced computational overhead, suggesting practical efficiency for RLHF.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.
