Table of Contents
Fetching ...

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $\tildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $\tildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most over episodes. We propose a unified algorithm with regret , which exhibits a best-of-both-regimes behavior: it achieves -dependent statistical gains when imperfection is small (where is the number of sources), while remaining robust with unavoidable additive dependence on when imperfection is large. We complement this with a lower bound , which captures the best possible improvement with respect to and the unavoidable dependence on , and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as . Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.
Paper Structure (38 sections, 20 theorems, 80 equations, 1 algorithm)

This paper contains 38 sections, 20 theorems, 80 equations, 1 algorithm.

Key Result

Theorem 2

Under the setting in Section sec:problemformulation, for any (possibly randomized) algorithm $\text{Alg}$ there exists an instance with satisfying (eq:def-uncertainty-budget), such that where the expectation is over the randomness of the environment, algorithm, and feedback.

Theorems & Definitions (25)

  • Remark 1: Identifiability and learnable object
  • Example 1: LLM RLHF: annotator/reward-model mismatch
  • Example 2: Autonomous driving: heterogeneous criteria
  • Theorem 2: Lower bound
  • Proposition 3: A counterexample
  • Theorem 4: Known transition kernel
  • Theorem 5: Unknown transition kernel
  • Definition 6: $\epsilon$-covering number
  • Definition 7: Eluder dimension
  • Theorem 8: Unknown transitions with general function approximation
  • ...and 15 more