Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa; Masaki Adachi; Michael A. Osborne

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa, Masaki Adachi, Michael A. Osborne

TL;DR

The paper tackles the problem of aligning language models to human preferences when crowd-sourced feedback is noisy. It introduces Hölder-DPO, a robust Direct Preference Optimization loss using Hölder divergence to achieve provable redescending robustness, enabling estimation of the clean data distribution and detection of mislabeled data without a clean validation set. It provides a principled method to estimate the contamination ratio $\epsilon$ and identifies mislabels, demonstrated on controlled tasks and real-world datasets (e.g., Anthropic HH), with scalability to larger models via LoRA and quantization. Empirically, Hölder-DPO improves alignment quality, detects substantial annotation noise, and can be used as a pre-filtering step for dataset cleaning, offering a scalable path to robust, automated feedback valuation and model alignment.

Abstract

Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods. The code is available at https://github.com/ma921/HolderDPO.

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

TL;DR

Abstract

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (46)