Language Model Alignment in Multilingual Trolley Problems
Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf
TL;DR
This work introduces MultiTP, a multilingual, parametric trolley-problem dataset derived from Moral Machine to evaluate how 19 LLMs align with diverse human moral judgments across 107 languages and six dimensions. Alignment is quantified by a global MIS, the $L_2$ distance between human and model preference vectors, computed with language-weighted country mappings. Across analyses, only a few models approach human-like alignment, while most exhibit notable misalignment, though there is little evidence that low-resource languages are systematically disadvantaged. The study further reveals meaningful dimension-specific biases (notably in gender, age, and fitness), substantial language sensitivity, and robustness of results to prompt paraphrasing, while jailbreaking modestly reduces refusals. Overall, the findings stress the importance of multilingual, culturally inclusive evaluation for responsible AI ethics and pave the way for pluralistic alignment research.
Abstract
We evaluate the moral alignment of LLMs with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs' decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at https://github.com/causalNLP/moralmachine
