Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Yuwei Cheng; Fan Yao; Xuefeng Liu; Haifeng Xu

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Yuwei Cheng, Fan Yao, Xuefeng Liu, Haifeng Xu

TL;DR

The Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm is developed, which achieves nearly optimal regret and can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms.

Abstract

This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as $t^{ρ-1}$ for some "imperfection rate" $ρ\in [0, 1]$. With $T$ as the total number of iterations, we establish a regret lower bound of $ Ω(\max\{\sqrt{T}, T^ρ\}) $ for LIHF, even when $ρ$ is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret $\tilde{\mathcal{O}}(\max\{\sqrt{T}, T^ρ\})$. Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

TL;DR

Abstract

for some "imperfection rate"

. With

as the total number of iterations, we establish a regret lower bound of

for LIHF, even when

is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret

. Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

Paper Structure (18 sections, 45 theorems, 83 equations, 3 figures, 3 algorithms)

This paper contains 18 sections, 45 theorems, 83 equations, 3 figures, 3 algorithms.

Introduction
The Problem of Learning from Imperfect Human Feedback
The Intrinsic Limit of LIHF
An Efficient and Tight $\rho$-LIHF Algorithm
Additional Applications of the Above Regret Analysis Framework
Experiments
Conclusion
Proofs for Theorem \ref{['theorem:regret_lower_bound']}
Proof for Lemma \ref{['lemma:lower bound']}
Proof for Proposition \ref{['prop:regret_lower_bound_linear']}:
Proof for Theorem \ref{['proposition:matching_upper_bound']}
Proof for Proposition \ref{['theorem:regret_upper_bound_unknown']}
Proof for Robustness Statement in Proposition \ref{['theorem:regret_upper_bound_unknown']}
Proof for Efficiency Statement in Proposition \ref{['theorem:regret_upper_bound_unknown']}
Proof for Proposition \ref{['theorem:regret_upper_bound_DBGD']}:
...and 3 more sections

Key Result

Theorem 1

There exists a $\rho$-Imperfect Human Feedback (see Definition def:general-adversary), strongly concave utility function $\mu$, and link function $\sigma$ such that any learner has to suffer $\text{Reg}_T \geq \Omega\left(d\max\{\sqrt{T}, T^{\rho}\}\right)$, even with the knowledge of$\rho$.

Figures (3)

Figure 1: Robustness of RoSMID for $\rho$-LIHF
Figure 2: For each algorithm, we tested its performance under $\rho$-Imperfect Human feedback with $\rho = 0.5, 0.6, 0.75$. For each $\rho$, we presented a line plot of the average regret over five simulations, accompanied by $\pm$ one standard deviation shown by the shaded region. In the legend, $o$ denotes the estimated line slope, calculated using least squares on the last $1\%$ of the data.
Figure 3: In the first row, we consider $\alpha = 0.05$ for DBGD and $\alpha = 0.9$ for RoSMID. In the second row, we consider $\alpha = 0.1$ for DBGD and $\alpha = 0.8$ for RoSMID. Given the $\alpha$, for each algorithm, we tested its performance under $\rho$-Imperfect Human feedback with $\rho = 0.5, 0.8, 0.95$. For each $\rho$, we presented a line plot of the average regret over five simulations, accompanied by $\pm$ one standard deviation shown by the shaded region. In the legend, $o$ denotes the estimated line slope, calculated using least squares on the last $1\%$ of the data.

Theorems & Definitions (47)

Definition 1: $\rho$-Imperfect Human Feedback
Theorem 1
Lemma 1: Lower Bound under Direct Reward Feedback
Proposition 1
Theorem 2
Lemma 2: Corrupted Gradients Estimation
Lemma 3: Regret Decomposition for Dueling Bandits
Lemma 4: Feedback Error
Lemma 5
Lemma 6: Induction Claim
...and 37 more

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

TL;DR

Abstract

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (47)