Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee; Minju Hong; Kwang-Sung Jun; Chulhee Yun; Se-Young Yun

Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

TL;DR

This work considers the problem of contextual online RLHF with general preferences, and proves that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.

Abstract

We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer (where $η^{-1}$ is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{O(η)}$-free regret $\tilde{O}(ηd^4 (\log T)^2)$. (2) Explore-Then-Commit achieves $\mathrm{poly}(d)$-free regret $\tilde{O}(\sqrt{ηr T})$ by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.

Regularized Online RLHF with Generalized Bilinear Preferences

TL;DR

Abstract

is the regularization strength), generalizing beyond prior works limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error - a result derived solely from strong convexity and the skew-symmetricity of GBPM.Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic,

-free regret

. (2) Explore-Then-Commit achieves

-free regret

by exploiting the low-rank structure; this is the first statistically efficient guarantee for online RLHF in high-dimensions.

Paper Structure (65 sections, 30 theorems, 147 equations, 3 tables)

This paper contains 65 sections, 30 theorems, 147 equations, 3 tables.

Introduction
RLHF Theory.
General Preference Learning.
Generalized Bilinear Preference Model (GBPM).
Contributions.
Notations.
Problem Setting
(Self-Play) Interaction Protocol
Interaction Protocol.
Feature Diversity.
Generalized Bilinear Preference Model (GBPM)
Regularized NE and Regret Definitions
Computation Oracles.
A New Analysis of Regularized Regret
Proof of Theorem 3.1
...and 50 more sections

Key Result

Lemma 2.4

For any $\bm\Theta \in \mathrm{Skew}(d)$, the value of the max-min game, $\max_{\pi^1} \min_{\pi^2} J_{\color{purple}\eta}(\pi^1, \pi^2; \bm\Theta)$, is always $\frac{1}{2}$.

Theorems & Definitions (56)

Remark 2.1: Scaling of $C_{\min}$
Definition 2.2: Generalized Bilinear Preference Model
Definition 2.3: Nash Equilibrium
Lemma 2.4
proof
Definition 2.5
Remark 2.6
Theorem 3.1
proof : Proof Sketch of \ref{['thm:regularized']}
Lemma 3.2
...and 46 more

Regularized Online RLHF with Generalized Bilinear Preferences

TL;DR

Abstract

Regularized Online RLHF with Generalized Bilinear Preferences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (56)