RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment
Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu
TL;DR
This work tackles the fragility of LLM alignment methods in the presence of noisy human preferences. It introduces Robust Enhanced Policy Optimization (RE-PO), an EM-based framework that treats each preference label as a latent variable and jointly infers per-label confidences and per-annotator reliabilities to reweight supervision, thereby yielding robust alignment across multiple loss formulations. The authors establish a theoretical consistency result under a perfectly calibrated model and demonstrate substantial empirical gains (up to 7.0 percentage points in AlpacaEval 2) across four direct preference objectives (DPO, IPO, SimPO, CPO) and two base models, including multi-annotator data (MultiPref). They also show that RE-PO serves as a versatile, general-purpose robustness layer with qualitative evidence that it down-weights noisy labels, offering practical benefits for real-world preference data.
Abstract
Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone for aligning large language models (LLMs) with human values. However, these methods typically assume that preference data is clean and that all labels are equally reliable. In practice, large-scale preference datasets contain substantial noise due to annotator mistakes, inconsistent instructions, varying expertise, and even adversarial or low-effort feedback. This mismatch between recorded labels and ground-truth preferences can misguide training and degrade model performance. To address this issue, we introduce Robust Enhanced Policy Optimization (RE-PO), which uses an expectation-maximization procedure to infer the posterior correctness of each label and then adaptively reweight data points in the training loss to mitigate label noise. We further generalize this idea by establishing a theoretical link between arbitrary preference losses and their underlying probabilistic models, enabling a systematic transformation of existing alignment algorithms into robust counterparts and elevating RE-PO from a single method to a general framework for robust preference alignment. Theoretically, we prove that, under a perfectly calibrated model, RE-PO recovers the true noise level of the dataset. Empirically, we show that RE-PO consistently improves four state-of-the-art alignment methods (DPO, IPO, SimPO, and CPO); when applied to Mistral and Llama 3 models, the RE-PO-enhanced variants increase AlpacaEval 2 win rates by up to 7.0 percent over their respective baselines.
