Table of Contents
Fetching ...

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa, Masaki Adachi, Michael A. Osborne

TL;DR

The paper tackles the problem of aligning language models to human preferences when crowd-sourced feedback is noisy. It introduces Hölder-DPO, a robust Direct Preference Optimization loss using Hölder divergence to achieve provable redescending robustness, enabling estimation of the clean data distribution and detection of mislabeled data without a clean validation set. It provides a principled method to estimate the contamination ratio $\epsilon$ and identifies mislabels, demonstrated on controlled tasks and real-world datasets (e.g., Anthropic HH), with scalability to larger models via LoRA and quantization. Empirically, Hölder-DPO improves alignment quality, detects substantial annotation noise, and can be used as a pre-filtering step for dataset cleaning, offering a scalable path to robust, automated feedback valuation and model alignment.

Abstract

Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods. The code is available at https://github.com/ma921/HolderDPO.

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

TL;DR

The paper tackles the problem of aligning language models to human preferences when crowd-sourced feedback is noisy. It introduces Hölder-DPO, a robust Direct Preference Optimization loss using Hölder divergence to achieve provable redescending robustness, enabling estimation of the clean data distribution and detection of mislabeled data without a clean validation set. It provides a principled method to estimate the contamination ratio and identifies mislabels, demonstrated on controlled tasks and real-world datasets (e.g., Anthropic HH), with scalability to larger models via LoRA and quantization. Empirically, Hölder-DPO improves alignment quality, detects substantial annotation noise, and can be used as a pre-filtering step for dataset cleaning, offering a scalable path to robust, automated feedback valuation and model alignment.

Abstract

Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods. The code is available at https://github.com/ma921/HolderDPO.

Paper Structure

This paper contains 53 sections, 24 theorems, 117 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Theorem 2

Let $\theta^*$ be the optimum learnt from the clean dataset $p_\mathcal{D}$, and $\theta^*(\epsilon)$ learnt from the $\epsilon$-contaminated dataset $p^{(\epsilon)}_{\widetilde{\mathcal{D}}}$. Let $\mathcal{L}_{\mathrm{gen}}(\pi_{\theta}; \pi_{\mathrm{ref}})$ be a generic DPO loss function correspo where $\nabla_{\theta} \mathcal{L}_{\mathrm{gen}}(s_\textrm{flip}, \pi_{\theta^{*}})$ corresponds,

Figures (7)

  • Figure 1: While existing DPO variants are vulnerable to $\epsilon$-contamination, Hölder-DPO is provably robust. It also ranks data points by clean-data likelihood, enabling mislabel identification.
  • Figure 2: IF analysis reveals only Hölder-DPO satisfies the redescending property.
  • Figure 3: Controlled sentiment generation task using GPT2-large. Error bars indicate the standard deviation over 10 trials with different random seeds. Hölder-DPO consistently outperforms all baselines in average reward under varying (a) contamination ratios $\epsilon$, (b) generation temperatures, and (c) training steps. In addition, only Hölder-DPO offers reliable (d) contamination ratio estimation $\hat{\epsilon}$ and (e) precision of mislabel identification, measured by precision as a binary classifier.
  • Figure 4: Helpful assistant dialogue generation on the Golden HH dataset. Error bars denote standard deviation over 10 trials with different random seeds. Hölder-DPO consistently achieves the highest GPT-4 win rate across both base models: (a)Qwen-2.5-1.5B and (b)Phi-2. Notably, Hölder-DPO is the only method that outperforms GPT-4-generated prompts even when using the smaller 2.8B Phi-2 model. It also delivers near-perfect (d) contamination ratio estimation $\hat{\epsilon}$ and (e) precision in mislabelled data detection, regardless of the base model.
  • Figure 5: Dataset valuation using Phi-2. (a) Hölder-DPO estimates substantial contamination ($\hat{\epsilon} \approx 0.25$) in popular Anthropic HH dataset. (b) Distribution of log-likelihoods of clean data points in the dataset. (c) Improvement in GPT-4 win rate across methods after removing detected noisy data points from the training set—surpassing even Hölder-DPO trained on the original (noisy) dataset.
  • ...and 2 more figures

Theorems & Definitions (46)

  • Definition 1: $\epsilon$-contamination model
  • Definition 2: Redescending property MaronnaRobust2019
  • Theorem 2: IF for DPO variants
  • Theorem 3: informal
  • Definition 3: Hölder divergence KANAMORI14
  • Remark 1
  • Theorem 4: IF for Hölder-DPO
  • Corollary 1: Hölder-DPO is robust
  • Proposition 1: Contamination ratio estimator
  • Theorem 5: IF for DPO
  • ...and 36 more