Table of Contents
Fetching ...

Clone-Robust AI Alignment

Ariel D. Procaccia, Benjamin Schiffer, Shirley Zhang

TL;DR

This work addresses RLHF alignment under diverse human preferences and unbalanced datasets by introducing robustness to approximate clones. It shows that the standard regularized MLE is not robust to near-duplicate alternatives and proposes Weighted MLE, which uses Voronoi-based weights to down-weight redundant options and achieve clone-robustness while preserving interpretability. The authors prove robustness under a Lipschitz reward assumption, relate Weighted MLE to weighted average win rates and S-space MLE, and illustrate the benefits with a case study using GPT-4o-mini to simulate annotators describing Paris. The results advance clone-robust AI alignment in RLHF and offer practical guidance for constructing more stable reward models in the presence of data duplication and diversity across annotators.

Abstract

A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.

Clone-Robust AI Alignment

TL;DR

This work addresses RLHF alignment under diverse human preferences and unbalanced datasets by introducing robustness to approximate clones. It shows that the standard regularized MLE is not robust to near-duplicate alternatives and proposes Weighted MLE, which uses Voronoi-based weights to down-weight redundant options and achieve clone-robustness while preserving interpretability. The authors prove robustness under a Lipschitz reward assumption, relate Weighted MLE to weighted average win rates and S-space MLE, and illustrate the benefits with a case study using GPT-4o-mini to simulate annotators describing Paris. The results advance clone-robust AI alignment in RLHF and offer practical guidance for constructing more stable reward models in the presence of data duplication and diversity across annotators.

Abstract

A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.
Paper Structure (26 sections, 12 theorems, 73 equations, 5 figures, 4 tables)

This paper contains 26 sections, 12 theorems, 73 equations, 5 figures, 4 tables.

Key Result

Theorem 2.2

Let $n = 2$ and suppose $\mathcal{D}$ is a representative preference dataset over alternatives in $\mathcal{M}$. Then for any algorithm $\mathrm{ALG}$ and any $C > 0$, there exist $r_1^*$ and $r_2^*$ such that $r^{\mathcal{D}} := \mathrm{ALG}(\mathcal{D})$ satisfies

Figures (5)

  • Figure 1: Voronoi diagram for $\mathcal{M} = \{(0,0), (1,0), (1,1)\}$.
  • Figure 2: Diagram for $\mathcal{M} = \{(0,0), (1,0), (1,1), (0.9,1)\}$.
  • Figure 3: Results for the MLE: The yellow points show the average value of the MLE reward function for different topics when trained on dataset 'Original'. The blue points show the same but when trained on dataset 'Clones'. In the presence of clones, the rewards for both art and romance change significantly, showing that the MLE is not robust to clones.
  • Figure 4: Results for the weighted MLE: The yellow points show the average value of the weighted MLE reward function for different topics when trained on dataset 'Original'. The blue points show the same but when trained on dataset 'Clones'. The rewards for the three topics do not change significantly, demonstrating the robustness of the weighted MLE.
  • Figure 5: Results for the weighted MLE when $\mathcal{S}$ is chosen as any vector such that every coordinate is within a factor of $2$ of one of the observed coordinates. The yellow points show the average value of the weighted MLE reward function for different topics when trained on dataset 'Original'. The blue points show the same but when trained on dataset 'Clones'. In both cases, the reward function has the highest value for romance, demonstrating the robustness of the weighted MLE.

Theorems & Definitions (28)

  • Example 2.1
  • Theorem 2.2
  • Definition 2.2
  • Theorem 2.3
  • Definition 3.1: Robust to Approximate Clones
  • Theorem 3.1
  • Definition 4.1
  • Theorem 4.1
  • Definition 4.2
  • Theorem 4.2
  • ...and 18 more