Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi; Yingbin Liang; Ness B. Shroff; Ananthram Swami

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami

Abstract

Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference labels are generated consistently from a single ground-truth objective. In practical RLHF systems, however, feedback is typically \emph{multi-source} (annotators, experts, reward models, heuristics) and can exhibit systematic, persistent mismatches due to subjectivity, expertise variation, and annotation/modeling artifacts. We study episodic RL from \emph{multi-source imperfect preferences} through a cumulative imperfection budget: for each source, the total deviation of its preference probabilities from an ideal oracle is at most $ω$ over $K$ episodes. We propose a unified algorithm with regret $\tilde{O}(\sqrt{K/M}+ω)$, which exhibits a best-of-both-regimes behavior: it achieves $M$-dependent statistical gains when imperfection is small (where $M$ is the number of sources), while remaining robust with unavoidable additive dependence on $ω$ when imperfection is large. We complement this with a lower bound $\tildeΩ(\max\{\sqrt{K/M},ω\})$, which captures the best possible improvement with respect to $M$ and the unavoidable dependence on $ω$, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as $\tildeΩ(\min\{ω\sqrt{K},K\})$. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Abstract

over

episodes. We propose a unified algorithm with regret

, which exhibits a best-of-both-regimes behavior: it achieves

-dependent statistical gains when imperfection is small (where

is the number of sources), while remaining robust with unavoidable additive dependence on

when imperfection is large. We complement this with a lower bound

, which captures the best possible improvement with respect to

and the unavoidable dependence on

, and a counterexample showing that naïvely treating imperfect feedback as as oracle-consistent can incur regret as large as

. Technically, our approach involves imperfection-adaptive weighted comparison learning, value-targeted transition estimation to control hidden feedback-induced distribution shift, and sub-importance sampling to keep the weighted objectives analyzable, yielding regret guarantees that quantify when multi-source feedback provably improves RLHF and how cumulative imperfection fundamentally limits it.

Paper Structure (38 sections, 20 theorems, 80 equations, 1 algorithm)

This paper contains 38 sections, 20 theorems, 80 equations, 1 algorithm.

Introduction
Related Work
Problem Formulation
Multi-Source Imperfect Preference Feedback
Performance Metric
Algorithm Design
Theoretical Results
A Lower Bound
A Counterexample: Ignoring Imperfection Can Be Much Worse
Regret Upper Bounds and Best-of-Both-Regimes Guarantee
Upper Bounds Under Linear Function Approximation
Upper Bounds Under General Function Approximation
Unknown Imperfection Budget
Conclusion and Future Work
Detailed Design of the Weights
...and 23 more sections

Key Result

Theorem 2

Under the setting in Section sec:problemformulation, for any (possibly randomized) algorithm $\text{Alg}$ there exists an instance with satisfying (eq:def-uncertainty-budget), such that where the expectation is over the randomness of the environment, algorithm, and feedback.

Theorems & Definitions (25)

Remark 1: Identifiability and learnable object
Example 1: LLM RLHF: annotator/reward-model mismatch
Example 2: Autonomous driving: heterogeneous criteria
Theorem 2: Lower bound
Proposition 3: A counterexample
Theorem 4: Known transition kernel
Theorem 5: Unknown transition kernel
Definition 6: $\epsilon$-covering number
Definition 7: Eluder dimension
Theorem 8: Unknown transitions with general function approximation
...and 15 more

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Abstract

Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret

Authors

Abstract

Table of Contents

Key Result

Theorems & Definitions (25)