Table of Contents
Fetching ...

CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models

Son The Nguyen, Niranjan Uma Naresh, Theja Tulabandhula

TL;DR

This work tackles the problem of aligning large language models with human values in the presence of incomplete and adversarially corrupted preference data. It introduces robust, polynomial-time ranking algorithms (RORATRON and CURATRON) that leverage a logit low-rank representation and RPCA to recover an $\epsilon$-optimal ranking with high probability, even when up to $O(n)$ perturbed pairwise observations occur per model response and data is partially observed. The methods generalize across ranking models (BTL, LR, and Thurstonian) and are complemented by a data-augmentation strategy that further improves recovery and downstream LLM alignment. Experimental results demonstrate strong robustness to adversarial noise and missing data, and show meaningful improvements in DPO-style fine-tuning when recovered, highlighting practical impact for scalable, ethically aligned AI systems.

Abstract

This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), focusing on incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs' resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley-Terry-Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an $ε$-optimal ranking with high probability while allowing as large as $O(n)$ perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.

CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models

TL;DR

This work tackles the problem of aligning large language models with human values in the presence of incomplete and adversarially corrupted preference data. It introduces robust, polynomial-time ranking algorithms (RORATRON and CURATRON) that leverage a logit low-rank representation and RPCA to recover an -optimal ranking with high probability, even when up to perturbed pairwise observations occur per model response and data is partially observed. The methods generalize across ranking models (BTL, LR, and Thurstonian) and are complemented by a data-augmentation strategy that further improves recovery and downstream LLM alignment. Experimental results demonstrate strong robustness to adversarial noise and missing data, and show meaningful improvements in DPO-style fine-tuning when recovered, highlighting practical impact for scalable, ethically aligned AI systems.

Abstract

This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), focusing on incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs' resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley-Terry-Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an -optimal ranking with high probability while allowing as large as perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.
Paper Structure (45 sections, 6 theorems, 15 equations, 7 figures, 3 tables, 5 algorithms)

This paper contains 45 sections, 6 theorems, 15 equations, 7 figures, 3 tables, 5 algorithms.

Key Result

Lemma 1

Let $a,b,c \in (0,1)$ such that $c=a+b$. Then, we have,

Figures (7)

  • Figure 1: CURATRON corrects incomplete and adversarially corrupted preference data to improve RLHF/DPO alignment results compared to raw initial preference data.
  • Figure 2: Three algorithms to address scenarios where incomplete data and adversarial corruption can impact LLMs.
  • Figure 3: Robust recovery results of the BTL model: we fix $\nu=2$ and vary $d$ in the left plot; we fix $d=100$ and vary $\nu$ in the left plot. The black line represents RORATRON, the blue line represents Maximum Likelihood (ML), the pink line represents Rank Centrality (RC), and the red line represents Borda Count (BC).
  • Figure 4: Left: Original matrix. Middle: Corrupted matrix. Right: Reconstructed matrix. The corrupted matrix has 10% adversarial corruptions and 10% of unobserved comparisons. We use our CURATRON algorithm to recover the original matrix.
  • Figure 5: Left column: NFE between reconstructed and original matrices. Middle column: Correlation between reconstructed and original matrices. Right column: Distance between reconstructed and original rankings. Top row: Unobserved and adversarial corruptions. Middle row: Recovering without augmentation. Bottom row: Recovering with augmentation. Average over 5 runs for different percentages of unobserved and adversarial comparisons.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Claim 1
  • Lemma 1: Some properties of the logit function
  • Theorem 1
  • Remark 1: Computational complexity
  • Remark 2: Identifying adversarially corrupted pairwise comparisons
  • Remark 3: Missing data versus adversarially corrupted data
  • Lemma 2: Incoherence of BTL and LR models
  • Corollary 1: Recovery result for BTL model
  • Theorem 2
  • Remark 4: Robust estimation of BTL model in the partially observed case
  • ...and 7 more