Table of Contents
Fetching ...

General Exploratory Bonus for Optimistic Exploration in RLHF

Wendi Li, Changdae Oh, Sharon Li

TL;DR

This work analyzes exploratory bonuses in RLHF and shows that standard KL and α-divergence–regularized formulations fail to realize optimism, tending to reward regions already well-covered by the reference model. It introduces General Exploratory Bonus (GEB), a reference-dependent framework that offsets divergence-induced bias and proves optimism under 0 ≤ α ≤ 1, while subsuming prior heuristics as special cases. The authors provide both theoretical guarantees and practical algorithms, including reward reparameterization, that integrate seamlessly into iterative online RLHF without extra sampling costs. Empirically, GEB improves alignment performance across divergences and backbones, promotes sampling in low-probability regions, and yields more diverse, semantically coherent outputs, highlighting its potential as a principled and scalable solution for optimistic exploration in RLHF.

Abstract

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $α$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $α$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

General Exploratory Bonus for Optimistic Exploration in RLHF

TL;DR

This work analyzes exploratory bonuses in RLHF and shows that standard KL and α-divergence–regularized formulations fail to realize optimism, tending to reward regions already well-covered by the reference model. It introduces General Exploratory Bonus (GEB), a reference-dependent framework that offsets divergence-induced bias and proves optimism under 0 ≤ α ≤ 1, while subsuming prior heuristics as special cases. The authors provide both theoretical guarantees and practical algorithms, including reward reparameterization, that integrate seamlessly into iterative online RLHF without extra sampling costs. Empirically, GEB improves alignment performance across divergences and backbones, promotes sampling in low-probability regions, and yields more diverse, semantically coherent outputs, highlighting its potential as a principled and scalable solution for optimistic exploration in RLHF.

Abstract

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or -divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full -divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

Paper Structure

This paper contains 54 sections, 6 theorems, 37 equations, 5 figures, 8 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $r_1 = \arg\min_r \mathcal{L}_{BT}(\mathcal{D},r)$ be a reward model trained with the vanilla BT loss, and let $r_2 = \arg\min_r [\mathcal{L}_{BT}(\mathcal{D},r) - \kappa \max_\pi \mathcal{J}_{\beta,\text{KL}}(\pi,r) ]$ be a reward model trained with an additional exploratory bonus. If the polic

Figures (5)

  • Figure 1: The upper part compares passive exploration and optimistic exploration. Optimistic exploration stimulates the trajectories $\tau$ of small $\pi_\text{ref}$ (seldom visited/uncertain), while passive exploration sticks to the high-$\pi_\text{ref}$ region, failing to approach global optima. The dashed line separates regions of high vs. low likelihood under the learning policy $\pi_{\theta}$. The lower part contrasts the effect of the exploration bonus term in optimistic reward modeling between prior works and our GEB. Prior works often emphasize rewards in frequently visited regions, which constrains exploration within certain areas. In contrast, our GEB amplifies rewards in seldom-visited regions, thereby encouraging further sampling in uncertain areas and successfully achieving optimistic exploration.
  • Figure 2: Comparison of $\log \pi_\textrm{ref}$ of sampled response in the last iteration between the general exploratory bonuses and vanilla iterative DPO. GEB-$\pi$, GEB-$1/\pi$, and GEB-$\mathrm{arctanh}(\pi-1)$ corresponds to $1+\alpha-\pi$, $1/\pi$, and $\mathrm{arctanh}(1-\pi)+\alpha$ as in Table \ref{['tab:exploration-bonus']}
  • Figure 3: Experiments with different $\kappa$. The three graphs are under KL divergence, Hellinger Distance, and forward KL divergence from left to right, respectively. The p, f, tanh in the legends correspond to $1+\alpha-\pi$, $1/\pi$, $\mathrm{arctanh}(1-\pi)+\alpha$ in Table \ref{['tab:exploration-bonus']} respectively.
  • Figure 4: Comparison on the bandit policy distributions trained with DPO (left) and GEB (right). The DPO policy collapses to a local optimum, while the GEB policy continues to explore and ultimately chooses the globally preferred action.
  • Figure 5: Initial reference bandit distribution ("ref") and the reward distribution. Because the most preferred action lies in a low-probability region, it is rarely visited under purely passive exploration.

Theorems & Definitions (9)

  • Definition 3.1: Optimism condition for exploration bonus
  • Lemma 3.1: Optimism failure under KL-divergence.
  • Definition 3.2: $\alpha$-divergence class
  • Lemma 3.2: Optimism failure under $\alpha$-divergence.
  • Theorem 3.3: Optimism failure beyond $\alpha$-divergence.
  • Lemma 4.1
  • Theorem 4.2
  • Definition C.1
  • Theorem C.1