Table of Contents
Fetching ...

From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Monika Jotautaite, Mary Phuong, Chatrik Singh Mangat, Maria Angelica Martinez

TL;DR

This work probes how large language models encode moral values by grounding analysis in Moral Foundations Theory and introducing MFD-LLM, a real-world dilemma dataset with 1079 scenarios mapped to six foundational actions. A novel multi-preference evaluation tracks the full spectrum of revealed moral preferences via sampling across rephrasings and four decision modes, addressing limitations of single-framing assessments. Across GPT, Claude, Llama, and Gemini families, the study finds a striking moral homogeneity aligned with Care and Fairness, but also a notable lack of consistency when scenarios are framed differently, suggesting robustness gaps and Western-leaning priors in training data. The dataset and methodology provide a scalable, nuanced tool for tracking moral value alignment in evolving LLMs and for informing more diverse, globally aware AI value systems.

Abstract

As large language models (LLMs) increasingly integrate into our daily lives, it becomes crucial to understand their implicit biases and moral tendencies. To address this, we introduce a Moral Foundations LLM dataset (MFD-LLM) grounded in Moral Foundations Theory, which conceptualizes human morality through six core foundations. We propose a novel evaluation method that captures the full spectrum of LLMs' revealed moral preferences by answering a range of real-world moral dilemmas. Our findings reveal that state-of-the-art models have remarkably homogeneous value preferences, yet demonstrate a lack of consistency.

From Stability to Inconsistency: A Study of Moral Preferences in LLMs

TL;DR

This work probes how large language models encode moral values by grounding analysis in Moral Foundations Theory and introducing MFD-LLM, a real-world dilemma dataset with 1079 scenarios mapped to six foundational actions. A novel multi-preference evaluation tracks the full spectrum of revealed moral preferences via sampling across rephrasings and four decision modes, addressing limitations of single-framing assessments. Across GPT, Claude, Llama, and Gemini families, the study finds a striking moral homogeneity aligned with Care and Fairness, but also a notable lack of consistency when scenarios are framed differently, suggesting robustness gaps and Western-leaning priors in training data. The dataset and methodology provide a scalable, nuanced tool for tracking moral value alignment in evolving LLMs and for informing more diverse, globally aware AI value systems.

Abstract

As large language models (LLMs) increasingly integrate into our daily lives, it becomes crucial to understand their implicit biases and moral tendencies. To address this, we introduce a Moral Foundations LLM dataset (MFD-LLM) grounded in Moral Foundations Theory, which conceptualizes human morality through six core foundations. We propose a novel evaluation method that captures the full spectrum of LLMs' revealed moral preferences by answering a range of real-world moral dilemmas. Our findings reveal that state-of-the-art models have remarkably homogeneous value preferences, yet demonstrate a lack of consistency.

Paper Structure

This paper contains 24 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Evaluation methodology: each scenario is posed to the LLM in 4 different ways to capture model preferences between different moral foundations.
  • Figure 2: Total preferences: How often each LLM (coloured lines) answered in line with each moral foundation (row). Each model was prompted to choose from six possible actions, each corresponding to a different moral foundation. The dots on each row represent the percentage of times a model's chosen action matched the respective moral foundation. Error bars reflect variability derived from bootstrapping.
  • Figure 3: Dataset clustered with k-means and visualised using t-SNE
  • Figure 4: Single preference evaluation: How often each LLM (coloured lines) answered in line with each moral foundation (row). For each scenario, the model was given a binary choice between performing an action that aligns with a moral foundation or not. The dots on each row represent how often the model chose to perform an action that aligns with the corresponding moral foundation. Error bars reflect variability derived from bootstrapping.
  • Figure 5: Pair (left) and triple (right) preferences for GPT-3.5. The arrows point towards the more preferred moral foundations for the model, and the thickness of the lines indicates how strongly one foundation is preferred over the other. Triple preferences are aggregated over all triplets of moral foundations and condensed into preference edges.
  • ...and 3 more figures