Table of Contents
Fetching ...

RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment

Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu

TL;DR

This work tackles the fragility of LLM alignment methods in the presence of noisy human preferences. It introduces Robust Enhanced Policy Optimization (RE-PO), an EM-based framework that treats each preference label as a latent variable and jointly infers per-label confidences and per-annotator reliabilities to reweight supervision, thereby yielding robust alignment across multiple loss formulations. The authors establish a theoretical consistency result under a perfectly calibrated model and demonstrate substantial empirical gains (up to 7.0 percentage points in AlpacaEval 2) across four direct preference objectives (DPO, IPO, SimPO, CPO) and two base models, including multi-annotator data (MultiPref). They also show that RE-PO serves as a versatile, general-purpose robustness layer with qualitative evidence that it down-weights noisy labels, offering practical benefits for real-world preference data.

Abstract

Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that preference data is clean and that all labels are equally reliable. In practice, large-scale preference datasets contain substantial noise due to annotator mistakes, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This mismatch between recorded labels and ground-truth preferences can misguide training and degrade model performance. To address this issue, we introduce Robust Enhanced Policy Optimization (RE-PO), which uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweight data points in the training loss to mitigate label noise. We further generalize this idea by establishing a theoretical link between arbitrary preference losses and their underlying probabilistic models, enabling a systematic transformation of existing alignment algorithms into robust counterparts and elevating RE-PO from a single method to a general framework for robust preference alignment. Theoretically, we prove that, under a perfectly calibrated model, RE-PO recovers the true noise level of the dataset. Empirically, we show that RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO); when applied to Mistral and Llama 3 models, the RE-PO-enhanced variants increase AlpacaEval 2 win rates by up to 7.0 percent over their respective baselines.

RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment

TL;DR

This work tackles the fragility of LLM alignment methods in the presence of noisy human preferences. It introduces Robust Enhanced Policy Optimization (RE-PO), an EM-based framework that treats each preference label as a latent variable and jointly infers per-label confidences and per-annotator reliabilities to reweight supervision, thereby yielding robust alignment across multiple loss formulations. The authors establish a theoretical consistency result under a perfectly calibrated model and demonstrate substantial empirical gains (up to 7.0 percentage points in AlpacaEval 2) across four direct preference objectives (DPO, IPO, SimPO, CPO) and two base models, including multi-annotator data (MultiPref). They also show that RE-PO serves as a versatile, general-purpose robustness layer with qualitative evidence that it down-weights noisy labels, offering practical benefits for real-world preference data.

Abstract

Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that preference data is clean and that all labels are equally reliable. In practice, large-scale preference datasets contain substantial noise due to annotator mistakes, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This mismatch between recorded labels and ground-truth preferences can misguide training and degrade model performance. To address this issue, we introduce Robust Enhanced Policy Optimization (RE-PO), which uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweight data points in the training loss to mitigate label noise. We further generalize this idea by establishing a theoretical link between arbitrary preference losses and their underlying probabilistic models, enabling a systematic transformation of existing alignment algorithms into robust counterparts and elevating RE-PO from a single method to a general framework for robust preference alignment. Theoretically, we prove that, under a perfectly calibrated model, RE-PO recovers the true noise level of the dataset. Empirically, we show that RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO); when applied to Mistral and Llama 3 models, the RE-PO-enhanced variants increase AlpacaEval 2 win rates by up to 7.0 percent over their respective baselines.

Paper Structure

This paper contains 50 sections, 1 theorem, 33 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\theta^\star$ be a perfectly calibrated parameter such that the model distribution matches the ground-truth preference distribution. Assume that not all $p_i^\star=p(y_{w, i} \succ^* y_{l,i}|x_i)$ equal $\tfrac12$ for $i\in\mathcal{I}_k$. Consider the sequence of reliability estimates $\{\eta_k

Figures (3)

  • Figure 1: Overview of the Robust Enhanced Policy Optimization (RE-PO) framework. Starting from noisy pairwise feedback, RE-PO uses an Expectation–Maximization (EM) procedure to jointly refine label confidences and the policy. In each iteration, the E-step estimates a confidence score for every observed preference by inferring the posterior probability that the label is correct under the current model and annotator reliabilities. The M-step then uses these scores as adaptive weights to update both the LLM policy and the annotator reliability parameters, progressively down-weighting likely corrupted labels and emphasizing reliable supervision.
  • Figure 2: Empirical verification of annotator reliability estimation under controlled synthetic noise. Ground-truth reliability ($\eta$ GPT-4o) is established using GPT-4o's labels on UltraFeedback-derived preference pairs, and different reliability levels are simulated by injecting synthetic noise into copies of the dataset. In the single-annotator setting (a), a single annotator's dataset is perturbed with varying noise rates. In the two-annotator setting (b), Annotator 1 uses the original data with no added noise, while noise is progressively added to Annotator 2's data. The plots compare ground-truth reliabilities (solid lines) with RE-PO-estimated reliabilities (dashed lines), showing that RE-PO closely tracks the true reliability in both scenarios.
  • Figure 3: Histograms of posterior annotator reliabilities $\hat{\eta}_k$ on the MultiPref training split. Rows correspond to backbones (Mistral-7B-Instruct-v0.2, top; Llama-3-8B-Instruct, bottom). Columns correspond to different choices of the prior mean $\eta_0 \in \{0.80, 0.90, 0.95, 0.99\}$ (from left to right). Each panel reports the empirical mean $\mu$ and standard deviation $\sigma$ of $\{\hat{\eta}_k\}_{k=1}^{227}$.

Theorems & Definitions (1)

  • Theorem 4.1: Identification and convergence of RE-PO