Table of Contents
Fetching ...

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Samuel Yeh, Sharon Li

TL;DR

This work addresses the problem of noisy human feedback in aligning large language models by introducing PrefCleanBench, the first comprehensive benchmark for 13 preference data cleaning methods. It standardizes a protocol spanning four public preference datasets, multiple backbones, and diverse optimization algorithms to evaluate both alignment performance and generalizability. Key findings show that identifying unreliable data using multiple judges and removing it generally yields better alignment than simply flipping labels, and that data cleaning methods vary in effectiveness depending on the optimization algorithm and base model. The study emphasizes data quality as a central factor in responsible AI development and provides open-source tools to foster reproducible, data-centric alignment research.

Abstract

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

TL;DR

This work addresses the problem of noisy human feedback in aligning large language models by introducing PrefCleanBench, the first comprehensive benchmark for 13 preference data cleaning methods. It standardizes a protocol spanning four public preference datasets, multiple backbones, and diverse optimization algorithms to evaluate both alignment performance and generalizability. Key findings show that identifying unreliable data using multiple judges and removing it generally yields better alignment than simply flipping labels, and that data cleaning methods vary in effectiveness depending on the optimization algorithm and base model. The study emphasizes data quality as a central factor in responsible AI development and provides open-source tools to foster reproducible, data-centric alignment research.

Abstract

Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

Paper Structure

This paper contains 42 sections, 1 equation, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The overview of the protocol for benchmarking data cleaning approaches. We propose a protocol that covers the selection of datasets, evaluation pipelines, as well as the evaluation criteria and their corresponding metrics.
  • Figure 2: The summarization of data cleaning approaches for LLM alignment. We categorize data cleaning approaches into three groups based on the definition of unreliability they considered. The three groups include LLM-as-a-judge, score of reward model, and heuristic criteria. indicates unreliable data identified by each approach.
  • Figure 3: Training hyperparameters for SFT and PEFT models.
  • Figure 4: Configurations of generating responses.

Theorems & Definitions (2)

  • Definition 3.1: Human preference data.
  • Definition 3.2: Preference data cleaning.