Table of Contents
Fetching ...

BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

Shahriyar Zaman Ridoy, Azmine Toushik Wasi, Koushik Ahamed Tonmoy

TL;DR

BengaliMoralBench targets a key gap in AI ethics by evaluating Bengali moral reasoning within its cultural context, addressing the Western-centric bias of existing benchmarks. The paper introduces a 3,000-instance dataset spanning five life domains and three ethical lenses (Virtue, Commonsense, Justice), collected via native Bengali annotators and validated through multi-stage QC, with a unified zero-shot evaluation protocol across multiple multilingual LLMs. Key contributions include the triadic reasoning framework, rigorous data collection and statistics, and a comprehensive qualitative error analysis that identifies cultural grounding gaps and context sensitivity as core weaknesses in current models. The findings demonstrate that scale and architecture alone do not guarantee culturally aligned ethics, underscoring the need for culturally grounded pretraining, targeted prompting, and human-in-the-loop governance to enable responsible deployment in Bengali-speaking regions. The work provides a public, reproducible framework to localize AI ethics evaluation, facilitating more robust, culturally aware LLMs for diverse multilingual settings.

Abstract

As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.

BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

TL;DR

BengaliMoralBench targets a key gap in AI ethics by evaluating Bengali moral reasoning within its cultural context, addressing the Western-centric bias of existing benchmarks. The paper introduces a 3,000-instance dataset spanning five life domains and three ethical lenses (Virtue, Commonsense, Justice), collected via native Bengali annotators and validated through multi-stage QC, with a unified zero-shot evaluation protocol across multiple multilingual LLMs. Key contributions include the triadic reasoning framework, rigorous data collection and statistics, and a comprehensive qualitative error analysis that identifies cultural grounding gaps and context sensitivity as core weaknesses in current models. The findings demonstrate that scale and architecture alone do not guarantee culturally aligned ethics, underscoring the need for culturally grounded pretraining, targeted prompting, and human-in-the-loop governance to enable responsible deployment in Bengali-speaking regions. The work provides a public, reproducible framework to localize AI ethics evaluation, facilitating more robust, culturally aware LLMs for diverse multilingual settings.

Abstract

As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.

Paper Structure

This paper contains 59 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of the BengaliMoralBench Benchmark.(a) Illustrates examples from the Virtue, Justice, and Commonsense ethical frameworks, each presenting paired Bengali-English ethical and unethical behavioral scenarios grounded in cultural context. (b) Shows the domain-wise subtopic distribution structured across five major life domains: Family Relationships, Habits, Parenting, Religious Activities, and Daily Activities. Each domain consists of 10 culturally grounded subtopics. Each subtopic contains 20 instances (10 ethical + 10 unethical), resulting in a total of 3,000 examples in the benchmark.
  • Figure 2: Overview of the BengaliMoralBench pipeline.(a) Benchmark: Native annotators wrote culturally grounded moral scenarios, refined through a pilot phase and multi-stage validation. (b) Evaluation: LLMs classify behaviors as Ethical or Unethical based on the chosen ethics type.
  • Figure 3: Average LLM performance (Accuracy and F1) across tasks.
  • Figure 4: Average LLM performance (MCC and Kappa) across tasks.
  • Figure 5: Relation between model parameters and evaluation metrics.
  • ...and 5 more figures