On The Stability of Moral Preferences: A Problem with Computational Elicitation Methods

Kyle Boerstler; Vijay Keswani; Lok Chan; Jana Schaich Borg; Vincent Conitzer; Hoda Heidari; Walter Sinnott-Armstrong

On The Stability of Moral Preferences: A Problem with Computational Elicitation Methods

Kyle Boerstler, Vijay Keswani, Lok Chan, Jana Schaich Borg, Vincent Conitzer, Hoda Heidari, Walter Sinnott-Armstrong

TL;DR

This study interrogates the assumption that moral preferences elicited in AI contexts are stable over time. By presenting two kidney-allocation scenarios across ten sessions, the authors quantify within- and between-participant stability, revealing substantial instability (especially in controversial cases) and links to decision difficulty and response time. They employ Bradley-Terry modeling to extract individual feature weights and compare session-specific predictive models (linear and non-linear) against a baseline fixed policy, finding that models drift across sessions and that aggregation can mask instability. The findings raise important concerns for training ethical AI systems, suggesting that repeated elicitation or boundary-aware approaches may be necessary to reliably capture stakeholder values and avoid misalignment. Overall, the paper highlights methodological vulnerabilities in single-shot moral preference elicitation and outlines implications for how to design robust, stakeholder-aligned ethical AI systems.

Abstract

Preference elicitation frameworks feature heavily in the research on participatory ethical AI tools and provide a viable mechanism to enquire and incorporate the moral values of various stakeholders. As part of the elicitation process, surveys about moral preferences, opinions, and judgments are typically administered only once to each participant. This methodological practice is reasonable if participants' responses are stable over time such that, all other relevant factors being held constant, their responses today will be the same as their responses to the same questions at a later time. However, we do not know how often that is the case. It is possible that participants' true moral preferences change, are subject to temporary moods or whims, or are influenced by environmental factors we don't track. If participants' moral responses are unstable in such ways, it would raise important methodological and theoretical issues for how participants' true moral preferences, opinions, and judgments can be ascertained. We address this possibility here by asking the same survey participants the same moral questions about which patient should receive a kidney when only one is available ten times in ten different sessions over two weeks, varying only presentation order across sessions. We measured how often participants gave different responses to simple (Study One) and more complicated (Study Two) repeated scenarios. On average, the fraction of times participants changed their responses to controversial scenarios was around 10-18% across studies, and this instability is observed to have positive associations with response time and decision-making difficulty. We discuss the implications of these results for the efficacy of moral preference elicitation, highlighting the role of response instability in causing value misalignment between stakeholders and AI tools trained on their moral judgments.

On The Stability of Moral Preferences: A Problem with Computational Elicitation Methods

TL;DR

Abstract

Paper Structure (35 sections, 3 equations, 7 figures, 10 tables)

This paper contains 35 sections, 3 equations, 7 figures, 10 tables.

Introduction
Our Contributions.
Related Work.
Study
Methods
Participants.
Study Design.
Experimental Procedure.
Statistics and Analysis
Outlier Removal.
Response stability.
Between-participant response agreement.
Modeling.
Statistical Significance Tests.
Results
...and 20 more sections

Figures (7)

Figure 1: Example of the scenario interface participants responded to in Study One.
Figure 2: Scatter plots of response stability vs priority score difference for all six repeated scenarios in Study Two. Plot titles provide the Spearman correlation coefficient values; coefficients with the "**" mark indicate that the value is statistically significant at $p<$0.01. For the uncontroversial scenarios (S2U1-S2U3), most participants were perfectly stable. For the controversial scenarios (S2C1-S2C3), best-fit lines show a significant positive association between stability and priority score difference.
Figure 3: Distribution of average stability levels across participants for the two studies.
Figure 4: Scatter plot of mean reaction time vs response stability for controversial repeated scenarios in Study One and Study Two. The plots also present the best-fit lines and Pearson correlation coefficient between these two variables (** indicates that the correlation was statistically significant at $p<0.05$).
Figure 5: Summary statistics for feature weights learned from the Bradley-Terry model for each participant in Study One.
...and 2 more figures

On The Stability of Moral Preferences: A Problem with Computational Elicitation Methods

TL;DR

Abstract

On The Stability of Moral Preferences: A Problem with Computational Elicitation Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (7)