Table of Contents
Fetching ...

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

Zhirui Chen, Vincent Y. F. Tan

TL;DR

The paper addresses offline RLHF with pairwise comparisons under a linear reward model and introduces RL-LOW, which achieves exponential simple regret decay with rate exp(-Ω(n/H(v))) where H(v) encodes instance hardness via suboptimality gaps. It proves a matching instance-dependent lower bound, establishing exponential-rate optimality, and extends RL-LOW to (ε,δ)-DP with label privacy while preserving the exponential rate asymptotically. The methods also yield a known-transitions MDP extension (RL-LOW-MDP) and a DP variant with explicit DP-dependent bounds, along with a worst-case O(n^{-1/2}) bound as a supplementary result. Together, the work fills a gap in the theory of offline RLHF by providing instance-dependent, exponential convergence guarantees rather than worst-case polynomial rates, with practical implications for efficient, privacy-preserving policy identification from offline human feedback.

Abstract

We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of $\exp ( - Ω(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of $(\varepsilon,δ)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., $\widetilde{O}(\frac{1}{\sqrt{n}})$) for offline RLHF with pairwise comparisons.

On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

TL;DR

The paper addresses offline RLHF with pairwise comparisons under a linear reward model and introduces RL-LOW, which achieves exponential simple regret decay with rate exp(-Ω(n/H(v))) where H(v) encodes instance hardness via suboptimality gaps. It proves a matching instance-dependent lower bound, establishing exponential-rate optimality, and extends RL-LOW to (ε,δ)-DP with label privacy while preserving the exponential rate asymptotically. The methods also yield a known-transitions MDP extension (RL-LOW-MDP) and a DP variant with explicit DP-dependent bounds, along with a worst-case O(n^{-1/2}) bound as a supplementary result. Together, the work fills a gap in the theory of offline RLHF by providing instance-dependent, exponential convergence guarantees rather than worst-case polynomial rates, with practical implications for efficient, privacy-preserving policy identification from offline human feedback.

Abstract

We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of where is the number of data samples and denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of -differential privacy and show, somewhat surprisingly, that the hardness parameter is unchanged in the asymptotic regime as tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., ) for offline RLHF with pairwise comparisons.
Paper Structure (29 sections, 24 theorems, 161 equations, 1 figure, 1 algorithm)

This paper contains 29 sections, 24 theorems, 161 equations, 1 figure, 1 algorithm.

Key Result

Proposition 2.3

(Impossibility Result) For any inconsistent instance $v=(\rho,\mathcal{S},\mathcal{A},\phi,N, \theta)$, there exists an instance $v'=(\rho,\mathcal{S},\mathcal{A},\phi,N, \theta')$ such that for all algorithms $\Pi$

Figures (1)

  • Figure 1: Comparison of RL-LOW and DP-RL-LOW to Pessimistic MLE on average simple regret and standard deviation (shaded area). In the left figure, we set $\delta=0.2$ and $\varepsilon=0.9$ for DP-RL-LOW. In the right figure, we set $n=400$ for all policies.

Theorems & Definitions (47)

  • Definition 2.2
  • Proposition 2.3
  • Definition 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Proposition 3.4
  • Lemma 4.1
  • Lemma 4.2
  • Theorem 4.3
  • Theorem 5.1
  • ...and 37 more