Table of Contents
Fetching ...

AI Alignment at Your Discretion

Maarten Buyl, Hadi Khalaf, Claudio Mayrink Verdun, Lucas Monteiro Paes, Caio C. Vieira Machado, Flavio du Pin Calmon

TL;DR

The paper investigates alignment discretion—the latitude annotators have to prioritize competing principles when judging model outputs—by formalizing preference functions and introducing metrics such as DA, PS, w^*_c(a), and DD (normalized Kendall distance). Using HH‑RLHF and PKU‑SafeRLHF datasets with GPT‑4o as an oracle, it empirically analyzes both human and algorithmic annotators to reveal substantial discretionary latitude and misalignment between human and model discretion. It finds that while reward models can mirror human principle prioritization to some extent, transferring this discretion to language models via RLHF remains challenging, and off‑the‑shelf models often diverge markedly from human preferences. The work argues for a legal‑inspired, auditable framework to document and constrain discretion in AI alignment, emphasizing transparency, oversight, and community governance to prevent arbitrary or opaque alignment decisions.

Abstract

In AI alignment, extensive latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are `better' or `safer.' We refer to this latitude as alignment discretion. Such discretion remains largely unexamined, posing two risks: (i) annotators may use their power of discretion arbitrarily, and (ii) models may fail to mimic this discretion. To study this phenomenon, we draw on legal concepts of discretion that structure how decision-making authority is conferred and exercised, particularly in cases where principles conflict or their application is unclear or irrelevant. Extended to AI alignment, discretion is required when alignment principles and rules are (inevitably) conflicting or indecisive. We present a set of metrics to systematically analyze when and how discretion in AI alignment is exercised, such that both risks (i) and (ii) can be observed. Moreover, we distinguish between human and algorithmic discretion and analyze the discrepancy between them. By measuring both human and algorithmic discretion over safety alignment datasets, we reveal layers of discretion in the alignment process that were previously unaccounted for. Furthermore, we demonstrate how algorithms trained on these datasets develop their own forms of discretion in interpreting and applying these principles, which challenges the purpose of having any principles at all. Our paper presents the first step towards formalizing this core gap in current alignment processes, and we call on the community to further scrutinize and control alignment discretion.

AI Alignment at Your Discretion

TL;DR

The paper investigates alignment discretion—the latitude annotators have to prioritize competing principles when judging model outputs—by formalizing preference functions and introducing metrics such as DA, PS, w^*_c(a), and DD (normalized Kendall distance). Using HH‑RLHF and PKU‑SafeRLHF datasets with GPT‑4o as an oracle, it empirically analyzes both human and algorithmic annotators to reveal substantial discretionary latitude and misalignment between human and model discretion. It finds that while reward models can mirror human principle prioritization to some extent, transferring this discretion to language models via RLHF remains challenging, and off‑the‑shelf models often diverge markedly from human preferences. The work argues for a legal‑inspired, auditable framework to document and constrain discretion in AI alignment, emphasizing transparency, oversight, and community governance to prevent arbitrary or opaque alignment decisions.

Abstract

In AI alignment, extensive latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are `better' or `safer.' We refer to this latitude as alignment discretion. Such discretion remains largely unexamined, posing two risks: (i) annotators may use their power of discretion arbitrarily, and (ii) models may fail to mimic this discretion. To study this phenomenon, we draw on legal concepts of discretion that structure how decision-making authority is conferred and exercised, particularly in cases where principles conflict or their application is unclear or irrelevant. Extended to AI alignment, discretion is required when alignment principles and rules are (inevitably) conflicting or indecisive. We present a set of metrics to systematically analyze when and how discretion in AI alignment is exercised, such that both risks (i) and (ii) can be observed. Moreover, we distinguish between human and algorithmic discretion and analyze the discrepancy between them. By measuring both human and algorithmic discretion over safety alignment datasets, we reveal layers of discretion in the alignment process that were previously unaccounted for. Furthermore, we demonstrate how algorithms trained on these datasets develop their own forms of discretion in interpreting and applying these principles, which challenges the purpose of having any principles at all. Our paper presents the first step towards formalizing this core gap in current alignment processes, and we call on the community to further scrutinize and control alignment discretion.

Paper Structure

This paper contains 30 sections, 14 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Illustration of how different prioritizations of principles affect which AI model responses are preferred, inspired by the xkcd comic about Asimov’s Three Laws of Robotics xkcd1613. The user asks for health advice, but an annotator's assessment of how best to respond depends on how they rank three principles: (A) being accurate in responding to medical concerns, (B) referring to experts, and (C) reducing patient anxiety. All three principles are independently desirable, but they allow for discretion in how they are balanced.
  • Figure 2: Illustration of the three principle agreement cases in Def. \ref{['def:cases']}. For each prompt, two candidate responses are evaluated against three principles ('Be helpful', 'Avoid harm', 'Refer to experts'). Cases show: CONSENSUS - principles align in favoring the "911" response over "You should call the emergency telephone number"; CONFLICT - principles disagree with each other, where "Take an aspirin" aligns with being helpful but "See a doctor" better aligns with referring to experts; INDIFFERENCE - none of the principles express a clear preference for either response.
  • Figure 3: Principle agreement frequency (%) according to the three cases distinguished in Def. \ref{['def:cases']}.
  • Figure 4: Principle supremacy matrix for the human annotator of the HH-RLHF dataset. The $(i, j)$ entry indicates the proportion of times that the $i^{\text{th}}$ principle 'wins' over the $j^{\text{th}}$ principle. A win is considered when the principles conflict, and the $i^{\text{th}}$ principle agrees with the human label whereas the $j^{\text{th}}$ principles disagrees with the human label. We also note the total number of cases of conflict per pair of principles. Empty entries indicates that the pair have never been in conflict. The principles are sorted in descending order of their priority weights, reaffirming that principles with higher priority weight are more likely to 'win' over a principle with lower weight.
  • Figure 5: Principle priorities (Def. \ref{['def:principles_priority']}) for each annotator of the HH dataset, excluding the base LLMs. Each plot represents an independent system of principle priorities specific to an annotator, so values are not comparable across subplots. Bars are shaded by principle ranking, with x-axis scales adjusted per annotator to reflect their full range. Red asterisks indicate principles that are never prioritized (i.e. with weight of negative infinity) while white asterisks indicate principles that are always prioritized (i.e. with weight of positive infinity). The principles, of which the full description is given in Tab. \ref{['tab:principles_descriptions']}, were interpreted in a broad sense -- principles like 'support democracy' can generally refer to a preference for responses that avoid subversion of the government.
  • ...and 9 more figures

Theorems & Definitions (10)

  • Definition 1: preference functions
  • Definition 2: reward model preference functions
  • Definition 3: LLM preference functions
  • Definition 4: principle-specific preferences
  • Definition 5: principle-specific preference functions
  • Definition 6: consensus, conflict, & indifference
  • Definition 7: Discretion Arbitrariness
  • Definition 8: Principle Supremacy
  • Definition 9: Principle Priority
  • Definition 10: Discretion Discrepancy