Measuring Moral LLM Responses in Multilingual Capacities

Kimaya Basu; Savi Kolari; Allison Yu

Measuring Moral LLM Responses in Multilingual Capacities

Kimaya Basu, Savi Kolari, Allison Yu

TL;DR

The paper addresses the challenge of evaluating moral, ethical, and safety-related responses of multilingual LLMs by constructing a dataset across five categories and six languages, and by using a 5-point rubric evaluated with a judge LLM. It benchmarks frontier models (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro) and open-source models (Llama 4 Scout, Qwen3 235B-a22b) through a translation-based pipeline, revealing significant cross-language inconsistencies, especially in Harm Prevention & Safety and Consent & Autonomy. GPT-5 generally outperforms peers, while Gemini 2.5 Pro shows notable weaknesses in trick-question contexts, underscoring safety-testing gaps across languages. The work highlights the need for more robust multilingual datasets and evaluation frameworks to ensure safe, reliable LLM behavior worldwide, and proposes directions for broader language coverage and human-baseline comparisons.

Abstract

With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.

Measuring Moral LLM Responses in Multilingual Capacities

TL;DR

Abstract

Measuring Moral LLM Responses in Multilingual Capacities

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)