EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
Hadi Mohammadi, Anastasia Giachanou, Ayoub Bagheri
TL;DR
EvalMORAAL introduces a transparent framework for evaluating moral alignment in LLMs across cultures by combining a two-score system (log-probability and direct CoT score), a structured five-sample chain-of-thought protocol, and a model-as-Judge peer-review mechanism. Applying this to 20 models over 64 countries and 23 topics using WVS and PEW benchmarks, the study reports near-survey-aligned performance for top models (WVS $r \approx 0.90$, PEW $r \approx 0.86$) but reveals a substantial Western vs non-Western gap ($r$ about $0.82$ vs $0.61$). The results show direct scoring consistently outperforms log-probability by about $\Delta r \approx 0.098$, and that the LLM-as-Judge system provides scalable quality signals with a Fleiss’ $\kappa = 0.67$ and high peer-agreement correlating with alignment. Despite progress, violence-related topics remain the hardest, and region-specific safeguards are necessary for equitable deployment. The work offers benchmarks, reproducible prompts, and practical guidance for building culturally aware AI systems while highlighting open challenges in cross-cultural moral reasoning and data representativeness.
Abstract
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.
