Table of Contents
Fetching ...

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang

TL;DR

This work provides a sharp theoretical treatment of reverse-KL regularization in KL-regularized contextual bandits and RLHF, demonstrating that KL regularization can yield significantly faster learning than unregularized objectives. It introduces a two-stage mixed sampling framework that, under sufficient data coverage from a reference policy, achieves an additive dependence on the coverage coefficient and a sample complexity of $O(\max(\eta^2 D^2, \eta/\epsilon) \log N_R(\epsilon/\delta))$; a matching lower bound shows an inherent $\Omega(\eta \log N_R(\epsilon)/\epsilon)$ term for small $\epsilon$. The analysis hinges on a novel suboptimality decomposition via a KL-aware objective and a sharp Taylor-based argument, clarifying how reverse KL shapes exploration-exploitation in these settings. The RLHF extension shows that similar additive-coverage gains hold for online preference-based learning, including local-coverage variants, and positions the approach relative to offline and hybrid RLHF methods. Together, the results illuminate the distinct statistical benefits of KL-regularization and data coverage, guiding the design of more efficient RLHF algorithms.

Abstract

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same $\mathcal{O}(1 / ε^2)$ sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an $\mathcal{O}(1 / ε)$ sample complexity when $ε$ is sufficiently small. We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.

Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

TL;DR

This work provides a sharp theoretical treatment of reverse-KL regularization in KL-regularized contextual bandits and RLHF, demonstrating that KL regularization can yield significantly faster learning than unregularized objectives. It introduces a two-stage mixed sampling framework that, under sufficient data coverage from a reference policy, achieves an additive dependence on the coverage coefficient and a sample complexity of ; a matching lower bound shows an inherent term for small . The analysis hinges on a novel suboptimality decomposition via a KL-aware objective and a sharp Taylor-based argument, clarifying how reverse KL shapes exploration-exploitation in these settings. The RLHF extension shows that similar additive-coverage gains hold for online preference-based learning, including local-coverage variants, and positions the approach relative to offline and hybrid RLHF methods. Together, the results illuminate the distinct statistical benefits of KL-regularization and data coverage, guiding the design of more efficient RLHF algorithms.

Abstract

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization have been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtains the same sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an sample complexity when is sufficiently small. We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.

Paper Structure

This paper contains 32 sections, 22 theorems, 106 equations, 2 figures, 2 algorithms.

Key Result

Theorem 3.6

For any $\epsilon \in (0, 1 / 256), \eta > 4$, and any algorithm $A$, there exists a KL-regularized contextual bandit problem with reward function class $\mathcal{R}$ and $O(N_\mathcal{R}(\epsilon))$ data coverage coefficient (as defined in Definition assumption:data coverage) such that $A$ requires

Figures (2)

  • Figure 1: Suboptimality gap for KL-regularized contextual bandits.
  • Figure 2: Suboptimality gap for reinforcement learning from preference feedback.

Theorems & Definitions (38)

  • Remark 3.1
  • Remark 3.2
  • Definition 3.3: $\epsilon$-cover and covering number
  • Definition 3.4: Policy Improvement Oracle
  • Definition 3.5: Data Coverage
  • Theorem 3.6
  • Remark 3.7
  • Theorem 3.8
  • Lemma 3.9
  • Definition 3.10: Global-Policy Coverage
  • ...and 28 more