Table of Contents
Fetching ...

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi, Anastasia Giachanou, Ayoub Bagheri

TL;DR

EvalMORAAL introduces a transparent framework for evaluating moral alignment in LLMs across cultures by combining a two-score system (log-probability and direct CoT score), a structured five-sample chain-of-thought protocol, and a model-as-Judge peer-review mechanism. Applying this to 20 models over 64 countries and 23 topics using WVS and PEW benchmarks, the study reports near-survey-aligned performance for top models (WVS $r \approx 0.90$, PEW $r \approx 0.86$) but reveals a substantial Western vs non-Western gap ($r$ about $0.82$ vs $0.61$). The results show direct scoring consistently outperforms log-probability by about $\Delta r \approx 0.098$, and that the LLM-as-Judge system provides scalable quality signals with a Fleiss’ $\kappa = 0.67$ and high peer-agreement correlating with alignment. Despite progress, violence-related topics remain the hardest, and region-specific safeguards are necessary for equitable deployment. The work offers benchmarks, reproducible prompts, and practical guidance for building culturally aware AI systems while highlighting open challenges in cross-cultural moral reasoning and data representativeness.

Abstract

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

TL;DR

EvalMORAAL introduces a transparent framework for evaluating moral alignment in LLMs across cultures by combining a two-score system (log-probability and direct CoT score), a structured five-sample chain-of-thought protocol, and a model-as-Judge peer-review mechanism. Applying this to 20 models over 64 countries and 23 topics using WVS and PEW benchmarks, the study reports near-survey-aligned performance for top models (WVS , PEW ) but reveals a substantial Western vs non-Western gap ( about vs ). The results show direct scoring consistently outperforms log-probability by about , and that the LLM-as-Judge system provides scalable quality signals with a Fleiss’ and high peer-agreement correlating with alignment. Despite progress, violence-related topics remain the hardest, and region-specific safeguards are necessary for equitable deployment. The work offers benchmarks, reproducible prompts, and practical guidance for building culturally aware AI systems while highlighting open challenges in cross-cultural moral reasoning and data representativeness.

Abstract

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

Paper Structure

This paper contains 35 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: EvalMORAAL Framework Overview.
  • Figure 2: Geographic alignment by tier. Cells show tier-averaged Pearson $r$ from direct CoT scores; tiers are defined on WVS and reused for PEW.
  • Figure 3: Peer‑agreement vs. survey alignment. Each point is one model; the x‑axis is Pearson $r_{\text{DIR}}$ computed from direct CoT scores. Models are colored by performance tiers defined on WVS $r_{\text{DIR}}$. Within‑tier OLS lines with 95% CIs are shown for visualization; given small Top‑tier $n$, bands are descriptive.
  • Figure 4: Distribution of score differences with conflict threshold at 0.38, stratified by performance tier (Top, Mid, Lower). Lower‑tier pairs exhibit more mass above the threshold.
  • Figure 5: Mean absolute error by topic, aggregated within performance tiers. Violence‑related topics (e.g., political violence, terrorism) are consistently hardest; errors shrink as tier improves.
  • ...and 7 more figures