Table of Contents
Fetching ...

Strong Preferences Affect the Robustness of Preference Models and Value Alignment

Ziwei Xu, Mohan Kankanhalli

TL;DR

This paper analyzes the robustness of value-alignment preference models by performing a theoretical sensitivity study across common frameworks, notably the Bradley-Terry and Plackett-Luce models, and extends to the $K$-tuple Plackett-Luce model. It shows that the probability of a given preference can be highly sensitive to changes in other preferences, especially when those other preferences are near dominance. The authors derive explicit sensitivity regions and areas, demonstrate that longer tuple models are generally more robust (smaller sensitivity regions) than pairwise models, and validate these ideas with experiments on LLMs (e.g., Llama-3-8B-Instruct) and other real-world reward models. The work highlights a practical trade-off between modeling dominant preferences and robustness, and suggests that adopting longer-tuples can improve robustness at the cost of data collection, with broad implications for value alignment and AI safety.

Abstract

Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.

Strong Preferences Affect the Robustness of Preference Models and Value Alignment

TL;DR

This paper analyzes the robustness of value-alignment preference models by performing a theoretical sensitivity study across common frameworks, notably the Bradley-Terry and Plackett-Luce models, and extends to the -tuple Plackett-Luce model. It shows that the probability of a given preference can be highly sensitive to changes in other preferences, especially when those other preferences are near dominance. The authors derive explicit sensitivity regions and areas, demonstrate that longer tuple models are generally more robust (smaller sensitivity regions) than pairwise models, and validate these ideas with experiments on LLMs (e.g., Llama-3-8B-Instruct) and other real-world reward models. The work highlights a practical trade-off between modeling dominant preferences and robustness, and suggests that adopting longer-tuples can improve robustness at the cost of data collection, with broad implications for value alignment and AI safety.

Abstract

Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.
Paper Structure (34 sections, 9 theorems, 28 equations, 5 figures, 3 tables)

This paper contains 34 sections, 9 theorems, 28 equations, 5 figures, 3 tables.

Key Result

lemma 1

For all $o_i, o_j, o_k \in \Set{O}$, and under the pairwise model $p^{\text{\tiny(2)}}_{ij} = g(s_i-s_j)$ following assumptions above, where $g^{-1}: (0,1) \rightarrow \mathbb{R}$ is the inverse of $g$, mapping a probability to a difference of scores.

Figures (5)

  • Figure 1: $M$-sensitive regions of $p^{\text{\tiny BT}}_{ij}$ w.r.t. $p^{\text{\tiny BT}}_{ik}$ and $p^{\text{\tiny BT}}_{kj}$, for $M=\{1.01,2,3,5,10\}$.
  • Figure 2: $M$-sensitive region of $p^{{(K)}}_{\Vec{\omega}}$ w.r.t. $p^{{(K)}}_{\Vec{\omega}_{uv}}$ and $p^{{(K)}}_{\Vec{\omega}_{vu}}$, with $\alpha=1.01, \beta=0.99$.
  • Figure 3: Preferences of Llama-3-8B-Instruct after being trained on constructed datasets with dominant preferences. Each data point in the figure represents one model trained on a particular dataset $\Set{D}(\Vec{\omega}_a, p^{\text{\tiny D}}_{12}, p^{\text{\tiny D}}_{23})$. $p^{\text{\tiny L}}_*$ are preference probabilities learned by the model. Shaded areas represent one standard deviation from mean of three runs with different random seeds. $\triangle$ and $\square$ markers indicate probabilities that are specified and unspecified by the dataset, respectively.
  • Figure C4: Preferences of zephyr-7b-alpha after being trained on constructed datasets with dominant preferences. Each data point in the figure represents one model trained on a particular dataset $\Set{D}(\Vec{\omega}_a, p^{\text{\tiny D}}_{12}, p^{\text{\tiny D}}_{23})$. $p^{\text{\tiny L}}_*$ are preference probabilities learned by the model. Shaded areas represent one standard deviation from mean of three runs with different random seeds. $\triangle$ and $\square$ markers indicate probabilities that are specified and unspecified by the dataset, respectively.
  • Figure C5: Preferences of Llama-3-8B-Instruct after being trained on constructed datasets with non-dominant $p^{\mathcal{D}}_{12}=0.5$. Each data point in the figure represents one model trained on a particular dataset $\Set{D}(\Vec{\omega}_a, p^{\text{\tiny D}}_{12}, p^{\text{\tiny D}}_{23})$. $p^{\text{\tiny L}}_*$ are preference probabilities learned by the model. Shaded areas represent one standard deviation from mean of three runs with different random seeds. $\triangle$ and $\square$ markers indicate probabilities that are specified and unspecified by the dataset, respectively.

Theorems & Definitions (21)

  • definition 1
  • definition 2
  • definition 3
  • definition 4
  • lemma 1
  • proof
  • lemma 2
  • proof
  • lemma 3
  • proof
  • ...and 11 more