Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference
Matteo Cercola, Valeria Capretti, Simone Formentin
TL;DR
This work addresses the data inefficiency of RLHF and GP-based PBO by proposing Bayesian RLHF, a hybrid framework that injects Laplace-based uncertainty into the reward model and uses an acquisition-driven strategy to actively select informative human preferences. The approach retains the scalability of neural models while achieving better sample efficiency through a last-layer Laplace approximation and a mixed dueling Thompson sampling acquisition (controlled by $\alpha$). Empirical results in high-dimensional numerical optimization and LLM fine-tuning show faster convergence and higher final accuracy under limited annotation budgets, with greater gains as budget grows. The findings highlight the practical potential of uncertainty-aware, acquisition-guided human-in-the-loop learning for complex, real-world tasks.
Abstract
Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.
