Moral Change or Noise? On Problems of Aligning AI With Temporally Unstable Human Feedback
Vijay Keswani, Cyrus Cousins, Breanna Nguyen, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong
TL;DR
This work investigates how temporally unstable human moral preferences challenge AI alignment in high-stakes domains. Through a longitudinal kidney-allocation experiment with over 400 participants across up to five sessions, the authors quantify two forms of preference drift—response instability and model instability—and demonstrate that both degrade predictive alignment performance over time. They categorize participants into four groups based on stability metrics and show that unstable groups incur substantially higher prediction errors across three modeling paradigms, highlighting the need to distinguish legitimate preference changes from noise. The study discusses normative questions about what to align to when preferences drift and offers technical paths toward dynamic, robust alignment, including richer data collection and interactive, evaluative feedback beyond single-session choice data.
Abstract
Alignment methods in moral domains seek to elicit moral preferences of human stakeholders and incorporate them into AI. This presupposes moral preferences as static targets, but such preferences often evolve over time. Proper alignment of AI to dynamic human preferences should ideally account for "legitimate" changes to moral reasoning, while ignoring changes related to attention deficits, cognitive biases, or other arbitrary factors. However, common AI alignment approaches largely neglect temporal changes in preferences, posing serious challenges to proper alignment, especially in high-stakes applications of AI, e.g., in healthcare domains, where misalignment can jeopardize the trustworthiness of the system and yield serious individual and societal harms. This work investigates the extent to which people's moral preferences change over time, and the impact of such changes on AI alignment. Our study is grounded in the kidney allocation domain, where we elicit responses to pairwise comparisons of hypothetical kidney transplant patients from over 400 participants across 3-5 sessions. We find that, on average, participants change their response to the same scenario presented at different times around 6-20% of the time (exhibiting "response instability"). Additionally, we observe significant shifts in several participants' retrofitted decision-making models over time (capturing "model instability"). The predictive performance of simple AI models decreases as a function of both response and model instability. Moreover, predictive performance diminishes over time, highlighting the importance of accounting for temporal changes in preferences during training. These findings raise fundamental normative and technical challenges relevant to AI alignment, highlighting the need to better understand the object of alignment (what to align to) when user preferences change significantly over time.
