Table of Contents
Fetching ...

Robust Reinforcement Learning from Corrupted Human Feedback

Alexander Bukharin, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru Zhang, Zixuan Zhang, Tuo Zhao

TL;DR

This work tackles the vulnerability of reinforcement learning from human feedback (RLHF) to corrupted preference labels by proposing R^3M, a robust reward modeling framework that treats label corruption as sparse outliers in a perturbed Bradley–Terry model. It jointly learns the ground-truth reward $r^*$ and sparse perturbations via an $\ell_1$-regularized maximum likelihood objective and employs an efficient alternating optimization with a closed-form update for perturbations, followed by standard policy optimization (PPO). The authors prove a nonparametric, high-probability error bound showing that reward recovery remains accurate when the number of outliers grows sublinearly with the data, and they extend the method to direct preference optimization (DPO). Empirically, R^3M improves robustness across robotic control and large language model benchmarks, and experiments demonstrate effective outlier detection and improved performance under various corruption models, highlighting practical impact for safer and more reliable RLHF systems.

Abstract

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- $R^3M$, which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an $\ell_1$-regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, $R^3M$ can consistently learn the underlying reward and identify outliers, provided that the number of outlier labels scales sublinearly with the preference sample size. Furthermore, we remark that $R^3M$ is versatile and can be extended to various preference optimization methods, including direct preference optimization (DPO). Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R^3M$ improves robustness of the reward against several types of perturbations to the preference data.

Robust Reinforcement Learning from Corrupted Human Feedback

TL;DR

This work tackles the vulnerability of reinforcement learning from human feedback (RLHF) to corrupted preference labels by proposing R^3M, a robust reward modeling framework that treats label corruption as sparse outliers in a perturbed Bradley–Terry model. It jointly learns the ground-truth reward and sparse perturbations via an -regularized maximum likelihood objective and employs an efficient alternating optimization with a closed-form update for perturbations, followed by standard policy optimization (PPO). The authors prove a nonparametric, high-probability error bound showing that reward recovery remains accurate when the number of outliers grows sublinearly with the data, and they extend the method to direct preference optimization (DPO). Empirically, R^3M improves robustness across robotic control and large language model benchmarks, and experiments demonstrate effective outlier detection and improved performance under various corruption models, highlighting practical impact for safer and more reliable RLHF systems.

Abstract

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- , which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an -regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, can consistently learn the underlying reward and identify outliers, provided that the number of outlier labels scales sublinearly with the preference sample size. Furthermore, we remark that is versatile and can be extended to various preference optimization methods, including direct preference optimization (DPO). Our experiments on robotic control and natural language generation with large language models (LLMs) show that improves robustness of the reward against several types of perturbations to the preference data.
Paper Structure (23 sections, 6 theorems, 51 equations, 7 figures, 4 tables)

This paper contains 23 sections, 6 theorems, 51 equations, 7 figures, 4 tables.

Key Result

Theorem 4.4

Suppose Assumptions assump:sparse and assump:idt hold. Let $\widehat{R} = [\widehat{r}(s,a)]$ and $\widehat{\delta}$ be the minimizer of eq:constrained-obj. Given $\lambda = 1/n$, there exists universal constants $C_0>0$ and $\gamma$, such that we have with overwhelming probability.

Figures (7)

  • Figure 1: Learning curves and percentile plots for the baseline (cross-entropy loss) and $R^3M$ for the stochastic noise model.
  • Figure 2: Learning curves and percentile plots for the baseline (cross-entropy loss) and $R^3M$ for the myopic noise model.
  • Figure 3: Learning curves and percentile plots for the baseline (cross-entropy loss) and $R^3M$ for the irrational noise model.
  • Figure 4: Comparison of outlier ratios between sample pairs with zero and positive learned perturbation factors for $\tau=1.0$, $\gamma=0.3$, and $p=1/3$ for the stochastic, myopic, and irrational noise models, respectively.
  • Figure 5: (a) Comparison of the Claude 3 agreement on the annotated labels between sample pairs with zero and positive learned perturbation factors. (b) An example of corrupted annotation in the HH dataset.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Remark 3.1
  • Remark 3.2
  • Remark 4.1
  • Theorem 4.4
  • Remark 4.5
  • Remark 4.6
  • Lemma 7.1
  • Lemma 7.2
  • Lemma 7.3
  • Lemma 7.4
  • ...and 1 more