Table of Contents
Fetching ...

Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback

Vijay Keswani, Cyrus Cousins, Breanna Nguyen, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong

TL;DR

This work investigates how temporally unstable human moral preferences challenge AI alignment in high-stakes domains. Through a longitudinal kidney-allocation experiment with over 400 participants across up to five sessions, the authors quantify two forms of preference drift—response instability and model instability—and demonstrate that both degrade predictive alignment performance over time. They categorize participants into four groups based on stability metrics and show that unstable groups incur substantially higher prediction errors across three modeling paradigms, highlighting the need to distinguish legitimate preference changes from noise. The study discusses normative questions about what to align to when preferences drift and offers technical paths toward dynamic, robust alignment, including richer data collection and interactive, evaluative feedback beyond single-session choice data.

Abstract

Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.

Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback

TL;DR

This work investigates how temporally unstable human moral preferences challenge AI alignment in high-stakes domains. Through a longitudinal kidney-allocation experiment with over 400 participants across up to five sessions, the authors quantify two forms of preference drift—response instability and model instability—and demonstrate that both degrade predictive alignment performance over time. They categorize participants into four groups based on stability metrics and show that unstable groups incur substantially higher prediction errors across three modeling paradigms, highlighting the need to distinguish legitimate preference changes from noise. The study discusses normative questions about what to align to when preferences drift and offers technical paths toward dynamic, robust alignment, including richer data collection and interactive, evaluative feedback beyond single-session choice data.

Abstract

Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.

Paper Structure

This paper contains 47 sections, 6 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Response stability distribution (median annotated) for all repeated scenarios. Participants were relatively more stable for $U_1, U_2$ compared to other scenarios.
  • Figure 2: Model Stability between sessions $\textrm{MS}(\cdot, \cdot)$ vs. time difference between sessions. Significant negative correlation indicates decreasing model stability with time.
  • Figure 3: Average response stability vs. model stability, with participants categorized based on their plot location.
  • Figure 4: Session-wise model entropy and model shift for all categories. Participant categories differ in how their model properties change over time, revealing change mechanisms.
  • Figure 5: Error rate boxplot of all models, showing significant disparities in performance across participant categories.
  • ...and 8 more figures