Table of Contents
Fetching ...

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

Xiang Ji, Huazheng Wang, Minshuo Chen, Tuo Zhao, Mengdi Wang

TL;DR

Motivates reward engineering via human feedback in real-world decision problems. The paper conducts a theoretical comparison of rating-based versus preference-based approaches in offline contextual bandits and introduces a monotone rating transformation to capture bias and uncertainty. It shows rating-based methods can incur constant suboptimality under partial coverage and may decouple from fast convergence, while preference-based methods under BTL can achieve faster suboptimality decay under mild noise and coverage; however, when rating and preference biases are matched, these advantages may vanish. The results indicate the empirical success of preference-based methods may stem from milder annotator bias and uncertainty, rather than an inherent methodological superiority.

Abstract

For a real-world decision-making problem, the reward function often needs to be engineered or learned. A popular approach is to utilize human feedback to learn a reward function for training. The most straightforward way to do so is to ask humans to provide ratings for state-action pairs on an absolute scale and take these ratings as reward samples directly. Another popular way is to ask humans to rank a small set of state-action pairs by preference and learn a reward function from these preference data. Recently, preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT. In this work, we develop a theoretical comparison between these human feedback approaches in offline contextual bandits and show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches. Through this, our results seek to provide a theoretical explanation for the empirical successes of preference-based methods from a modeling perspective.

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

TL;DR

Motivates reward engineering via human feedback in real-world decision problems. The paper conducts a theoretical comparison of rating-based versus preference-based approaches in offline contextual bandits and introduces a monotone rating transformation to capture bias and uncertainty. It shows rating-based methods can incur constant suboptimality under partial coverage and may decouple from fast convergence, while preference-based methods under BTL can achieve faster suboptimality decay under mild noise and coverage; however, when rating and preference biases are matched, these advantages may vanish. The results indicate the empirical success of preference-based methods may stem from milder annotator bias and uncertainty, rather than an inherent methodological superiority.

Abstract

For a real-world decision-making problem, the reward function often needs to be engineered or learned. A popular approach is to utilize human feedback to learn a reward function for training. The most straightforward way to do so is to ask humans to provide ratings for state-action pairs on an absolute scale and take these ratings as reward samples directly. Another popular way is to ask humans to rank a small set of state-action pairs by preference and learn a reward function from these preference data. Recently, preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT. In this work, we develop a theoretical comparison between these human feedback approaches in offline contextual bandits and show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches. Through this, our results seek to provide a theoretical explanation for the empirical successes of preference-based methods from a modeling perspective.
Paper Structure (29 sections, 17 theorems, 74 equations, 2 algorithms)

This paper contains 29 sections, 17 theorems, 74 equations, 2 algorithms.

Key Result

Theorem 1

For any fixed constant $0 < \delta < 1$, there exists a contextual bandit instance with initial state distribution $\rho$ such that if one samples a dataset $\mathcal{D}$ of size $n \ge c(\delta,c_b,c_V,q,\sigma,R)$ using a sampling distribution $d$ satisfying Assumption assumption:Cstar with $C^\st where $c_0$ is a universal constant and $c(\delta,c_b,c_V,q,\sigma,R)$ is a constant depending on $

Theorems & Definitions (20)

  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Theorem 3
  • Corollary 3
  • Theorem 4
  • ...and 10 more