Table of Contents
Fetching ...

Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

Matteo Cercola, Valeria Capretti, Simone Formentin

TL;DR

This work addresses the data inefficiency of RLHF and GP-based PBO by proposing Bayesian RLHF, a hybrid framework that injects Laplace-based uncertainty into the reward model and uses an acquisition-driven strategy to actively select informative human preferences. The approach retains the scalability of neural models while achieving better sample efficiency through a last-layer Laplace approximation and a mixed dueling Thompson sampling acquisition (controlled by $\alpha$). Empirical results in high-dimensional numerical optimization and LLM fine-tuning show faster convergence and higher final accuracy under limited annotation budgets, with greater gains as budget grows. The findings highlight the practical potential of uncertainty-aware, acquisition-guided human-in-the-loop learning for complex, real-world tasks.

Abstract

Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.

Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

TL;DR

This work addresses the data inefficiency of RLHF and GP-based PBO by proposing Bayesian RLHF, a hybrid framework that injects Laplace-based uncertainty into the reward model and uses an acquisition-driven strategy to actively select informative human preferences. The approach retains the scalability of neural models while achieving better sample efficiency through a last-layer Laplace approximation and a mixed dueling Thompson sampling acquisition (controlled by ). Empirical results in high-dimensional numerical optimization and LLM fine-tuning show faster convergence and higher final accuracy under limited annotation budgets, with greater gains as budget grows. The findings highlight the practical potential of uncertainty-aware, acquisition-guided human-in-the-loop learning for complex, real-world tasks.

Abstract

Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.

Paper Structure

This paper contains 13 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed Bayesian RLHF framework, integrating Laplace-based uncertainty estimation in the reward model and an acquisition function for efficient preference querying. Novel components relative to standard RLHF are highlighted in red.
  • Figure 2: Comparison between Bayesian RLHF (B-RLHF) in blue and baseline PBO in orange on the Rosenbrock problem. Solid lines indicate the mean response, and the shaded bands represent $\pm$ one standard deviation over 5 Monte Carlo runs.
  • Figure 3: Best value of the latent function on the Rosenbrock optimization problem, achieved by our algorithm Bayesian RLHF (B-RLHF). Solid lines indicate the mean response, and the shaded bands represent $\pm$ one standard deviation over 3 Monte Carlo runs.
  • Figure 4: Mean and standard deviation of the final optimization error across 3 independent runs for B-RLHF and PBO on the 10D Rosenbrock function with a budget of 4000 queries, a 10-hour time limit.
  • Figure 5: Sensitivity analysis of the $\alpha$ exploration–exploitation parameter, averaged over 38 Monte Carlo runs.
  • ...and 1 more figures