Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Samuel Yeh; Sharon Li

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Samuel Yeh, Sharon Li

TL;DR

This work addresses the problem of noisy human feedback in aligning large language models by introducing PrefCleanBench, the first comprehensive benchmark for 13 preference data cleaning methods. It standardizes a protocol spanning four public preference datasets, multiple backbones, and diverse optimization algorithms to evaluate both alignment performance and generalizability. Key findings show that identifying unreliable data using multiple judges and removing it generally yields better alignment than simply flipping labels, and that data cleaning methods vary in effectiveness depending on the optimization algorithm and base model. The study emphasizes data quality as a central factor in responsible AI development and provides open-source tools to foster reproducible, data-centric alignment research.

Abstract

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

TL;DR

Abstract

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)

Theorems & Definitions (2)