Table of Contents
Fetching ...

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models

Erica Coppolillo, Emilio Ferrara

TL;DR

MOSAIC is introduced, the first large-scale benchmark designed to jointly assess the moral, social, and individual characteristics of LLMs, and provides the first empirical evidence that MFT alone is insufficient to comprehensively evaluate complex AI systems'ethical behavior.

Abstract

Large Language Models (LLMs) are increasingly deployed in sensitive applications including psychological support, healthcare, and high-stakes decision-making. This expansion has motivated growing research into the ethical and moral foundations underlying LLM behavior, raising critical questions about their reliability in ethical reasoning. However, existing studies and benchmarks rely almost exclusively on Moral Foundation Theory (MFT), largely neglecting other relevant dimensions such as social values, personality traits, and individual characteristics that shape human ethical reasoning. To address these limitations, we introduce MOSAIC, the first large-scale benchmark designed to jointly assess the moral, social, and individual characteristics of LLMs. The benchmark comprises nine validated questionnaires drawn from moral philosophy, psychology, and social theory, alongside four platform-based games designed to probe morally ambiguous scenarios. In total, MOSAIC includes over 600 curated questions and scenarios, released as a ready-to-use, extensible resource for evaluating the behavioral foundations of LLMs. We validate the benchmark across three models from different families, demonstrating its utility across all assessed dimensions and providing the first empirical evidence that MFT alone is insufficient to comprehensively evaluate complex AI systems' ethical behavior. We publicly release the dataset and our benchmark Python library.

MOSAIC: Unveiling the Moral, Social and Individual Dimensions of Large Language Models

TL;DR

MOSAIC is introduced, the first large-scale benchmark designed to jointly assess the moral, social, and individual characteristics of LLMs, and provides the first empirical evidence that MFT alone is insufficient to comprehensively evaluate complex AI systems'ethical behavior.

Abstract

Large Language Models (LLMs) are increasingly deployed in sensitive applications including psychological support, healthcare, and high-stakes decision-making. This expansion has motivated growing research into the ethical and moral foundations underlying LLM behavior, raising critical questions about their reliability in ethical reasoning. However, existing studies and benchmarks rely almost exclusively on Moral Foundation Theory (MFT), largely neglecting other relevant dimensions such as social values, personality traits, and individual characteristics that shape human ethical reasoning. To address these limitations, we introduce MOSAIC, the first large-scale benchmark designed to jointly assess the moral, social, and individual characteristics of LLMs. The benchmark comprises nine validated questionnaires drawn from moral philosophy, psychology, and social theory, alongside four platform-based games designed to probe morally ambiguous scenarios. In total, MOSAIC includes over 600 curated questions and scenarios, released as a ready-to-use, extensible resource for evaluating the behavioral foundations of LLMs. We validate the benchmark across three models from different families, demonstrating its utility across all assessed dimensions and providing the first empirical evidence that MFT alone is insufficient to comprehensively evaluate complex AI systems' ethical behavior. We publicly release the dataset and our benchmark Python library.
Paper Structure (15 sections, 14 figures, 2 tables)

This paper contains 15 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Comparison between the correlation patterns assessed from human populations (a) and on the tested LLMs via MOSAIC (b). $^{***}p < .001$, $^{**} p <.01$, $^*p<.05$. Blank cells indicate correlations without statistical significance ($p \geq .05$). Green (resp. red) denotes positive (resp. negative) correlation, with darker color indicating stronger association.
  • Figure 2: Scores obtained by the models on the MFQ-2 questionnaire. Higher values indicate higher propensity to the specific moral foundation.
  • Figure 3: Models scores according to the LSRP test, categorized by primary- and secondary-psychopathy. Error bars represent standard deviations, while background colors denote increasing risk of psychopathy according to levenson1995assessing.
  • Figure 4: Results on the Social Dominance Orientation questionnaire (SDO), divided by Anti-Egalitarianism and Dominance social dominance. Error bars indicate standard deviations. Darker background color indicates higher preference of social dominance.
  • Figure 5: Scores obtained by the tested LLMs in terms of the Individualism-Collectivism Scale (ICS). X-axes represent Horizontal (top) and Vertical (bottom) dimensions, while Y-axes reflect Collectivism (left) and Individualism (right). Error bars on markers indicate standard deviations.
  • ...and 9 more figures