Table of Contents
Fetching ...

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Yuwei Cheng, Fan Yao, Xuefeng Liu, Haifeng Xu

TL;DR

The Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm is developed, which achieves nearly optimal regret and can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms.

Abstract

This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as $t^{ρ-1}$ for some "imperfection rate" $ρ\in [0, 1]$. With $T$ as the total number of iterations, we establish a regret lower bound of $ Ω(\max\{\sqrt{T}, T^ρ\}) $ for LIHF, even when $ρ$ is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret $\tilde{\mathcal{O}}(\max\{\sqrt{T}, T^ρ\})$. Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

TL;DR

The Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm is developed, which achieves nearly optimal regret and can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms.

Abstract

This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as for some "imperfection rate" . With as the total number of iterations, we establish a regret lower bound of for LIHF, even when is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret . Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.
Paper Structure (18 sections, 45 theorems, 83 equations, 3 figures, 3 algorithms)

This paper contains 18 sections, 45 theorems, 83 equations, 3 figures, 3 algorithms.

Key Result

Theorem 1

There exists a $\rho$-Imperfect Human Feedback (see Definition def:general-adversary), strongly concave utility function $\mu$, and link function $\sigma$ such that any learner has to suffer $\text{Reg}_T \geq \Omega\left(d\max\{\sqrt{T}, T^{\rho}\}\right)$, even with the knowledge of$\rho$.

Figures (3)

  • Figure 1: Robustness of RoSMID for $\rho$-LIHF
  • Figure 2: For each algorithm, we tested its performance under $\rho$-Imperfect Human feedback with $\rho = 0.5, 0.6, 0.75$. For each $\rho$, we presented a line plot of the average regret over five simulations, accompanied by $\pm$ one standard deviation shown by the shaded region. In the legend, $o$ denotes the estimated line slope, calculated using least squares on the last $1\%$ of the data.
  • Figure 3: In the first row, we consider $\alpha = 0.05$ for DBGD and $\alpha = 0.9$ for RoSMID. In the second row, we consider $\alpha = 0.1$ for DBGD and $\alpha = 0.8$ for RoSMID. Given the $\alpha$, for each algorithm, we tested its performance under $\rho$-Imperfect Human feedback with $\rho = 0.5, 0.8, 0.95$. For each $\rho$, we presented a line plot of the average regret over five simulations, accompanied by $\pm$ one standard deviation shown by the shaded region. In the legend, $o$ denotes the estimated line slope, calculated using least squares on the last $1\%$ of the data.

Theorems & Definitions (47)

  • Definition 1: $\rho$-Imperfect Human Feedback
  • Theorem 1
  • Lemma 1: Lower Bound under Direct Reward Feedback
  • Proposition 1
  • Theorem 2
  • Lemma 2: Corrupted Gradients Estimation
  • Lemma 3: Regret Decomposition for Dueling Bandits
  • Lemma 4: Feedback Error
  • Lemma 5
  • Lemma 6: Induction Claim
  • ...and 37 more