Language Model Alignment in Multilingual Trolley Problems

Zhijing Jin; Max Kleiman-Weiner; Giorgio Piatti; Sydney Levine; Jiarui Liu; Fernando Gonzalez; Francesco Ortu; András Strausz; Mrinmaya Sachan; Rada Mihalcea; Yejin Choi; Bernhard Schölkopf

Language Model Alignment in Multilingual Trolley Problems

Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf

TL;DR

This work introduces MultiTP, a multilingual, parametric trolley-problem dataset derived from Moral Machine to evaluate how 19 LLMs align with diverse human moral judgments across 107 languages and six dimensions. Alignment is quantified by a global MIS, the $L_2$ distance between human and model preference vectors, computed with language-weighted country mappings. Across analyses, only a few models approach human-like alignment, while most exhibit notable misalignment, though there is little evidence that low-resource languages are systematically disadvantaged. The study further reveals meaningful dimension-specific biases (notably in gender, age, and fitness), substantial language sensitivity, and robustness of results to prompt paraphrasing, while jailbreaking modestly reduces refusals. Overall, the findings stress the importance of multilingual, culturally inclusive evaluation for responsible AI ethics and pave the way for pluralistic alignment research.

Abstract

We evaluate the moral alignment of LLMs with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs' decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at https://github.com/causalNLP/moralmachine

Language Model Alignment in Multilingual Trolley Problems

TL;DR

distance between human and model preference vectors, computed with language-weighted country mappings. Across analyses, only a few models approach human-like alignment, while most exhibit notable misalignment, though there is little evidence that low-resource languages are systematically disadvantaged. The study further reveals meaningful dimension-specific biases (notably in gender, age, and fitness), substantial language sensitivity, and robustness of results to prompt paraphrasing, while jailbreaking modestly reduces refusals. Overall, the findings stress the importance of multilingual, culturally inclusive evaluation for responsible AI ethics and pave the way for pluralistic alignment research.

Abstract

Paper Structure (53 sections, 1 equation, 24 figures, 14 tables)

This paper contains 53 sections, 1 equation, 24 figures, 14 tables.

Introduction
Related Work
MultiTP: Evaluating LLMs in Multilingual Trolley Problems
Trolley Problem Setup
The Original Human Study
Vignette Template
Systematic Variations
Prompt Construction in Multiple Languages
Setup for LLM Testing
Multilingual Variation
Dataset Statistics
Evaluation Design
Model Selection
Preference Assessment
Misalignment Metric
...and 38 more sections

Figures (24)

Figure 1: An example scenario in the MultiTP dataset. Each question is presented in 107 different languages. Here, we select three languages, German, Chinese, and Swahili, and show the responses of LLMs (English translations provided for readers).
Figure 2: Model alignment on trolley problems with human preferences.
Figure 3: Radar plots of the preference decomposition across the six different moral dimensions. Llama 3.1 70B aligns well on most dimensions except for gender. However, GPT-4o Mini lacks diversity and tends to binarize on each dimension, which results in larger misalignment.
Figure 3: Language sensitivity scores for all models. Higher scores mean more varied misalignment across languages. We highlight the models with the most and least language sensitivity scores.
Figure 4: Distribution of preferences by each moral dimension across languages, using the most-aligned model Llama 3.1 70B. The dashed line is the overall human preference on each dimension. Cluster A: Georgian, Filipino, Maltese, etc. B: German, Italian, Ukrainian, etc. C: English, Finnish, Chinese, etc. D: Hungarian, Kazakh, Uyghur, etc. See \ref{['app:llm_sensitive_languages']} for the entire list of languages in each cluster, as well as clustering results for other models such as GPT-3 and GPT-4.
...and 19 more figures

Language Model Alignment in Multilingual Trolley Problems

TL;DR

Abstract

Language Model Alignment in Multilingual Trolley Problems

Authors

TL;DR

Abstract

Table of Contents

Figures (24)