Table of Contents
Fetching ...

Exploring Bengali Religious Dialect Biases in Large Language Models with Evaluation Perspectives

Azmine Toushik Wasi, Raima Islam, Mst Rafia Islam, Taki Hasan Rafi, Dong-Kyu Chae

TL;DR

The paper investigates Bengali religious dialect biases in large language models, focusing on Hindu and Muslim dialect signals. It constructs a prompt-based auditing framework and evaluates three popular LLMs (ChatGPT, Gemini, Copilot) across 20 sentences with religiously-toned content, examining prompt effects, memory, and contextual inference. Key findings reveal persistent Muslim-leaning bias, improvements when religion is explicitly mentioned in prompts but with leakage to other dialects, and better contextual inference from related text than explicit mentions, though inaccuracies remain. The authors discuss evaluation perspectives, propose multi-faceted mitigation strategies and human-in-the-loop approaches, and consider societal implications, while acknowledging limitations related to dataset size and model access.

Abstract

While Large Language Models (LLM) have created a massive technological impact in the past decade, allowing for human-enabled applications, they can produce output that contains stereotypes and biases, especially when using low-resource languages. This can be of great ethical concern when dealing with sensitive topics such as religion. As a means toward making LLMS more fair, we explore bias from a religious perspective in Bengali, focusing specifically on two main religious dialects: Hindu and Muslim-majority dialects. Here, we perform different experiments and audit showing the comparative analysis of different sentences using three commonly used LLMs: ChatGPT, Gemini, and Microsoft Copilot, pertaining to the Hindu and Muslim dialects of specific words and showcasing which ones catch the social biases and which do not. Furthermore, we analyze our findings and relate them to potential reasons and evaluation perspectives, considering their global impact with over 300 million speakers worldwide. With this work, we hope to establish the rigor for creating more fairness in LLMs, as these are widely used as creative writing agents.

Exploring Bengali Religious Dialect Biases in Large Language Models with Evaluation Perspectives

TL;DR

The paper investigates Bengali religious dialect biases in large language models, focusing on Hindu and Muslim dialect signals. It constructs a prompt-based auditing framework and evaluates three popular LLMs (ChatGPT, Gemini, Copilot) across 20 sentences with religiously-toned content, examining prompt effects, memory, and contextual inference. Key findings reveal persistent Muslim-leaning bias, improvements when religion is explicitly mentioned in prompts but with leakage to other dialects, and better contextual inference from related text than explicit mentions, though inaccuracies remain. The authors discuss evaluation perspectives, propose multi-faceted mitigation strategies and human-in-the-loop approaches, and consider societal implications, while acknowledging limitations related to dataset size and model access.

Abstract

While Large Language Models (LLM) have created a massive technological impact in the past decade, allowing for human-enabled applications, they can produce output that contains stereotypes and biases, especially when using low-resource languages. This can be of great ethical concern when dealing with sensitive topics such as religion. As a means toward making LLMS more fair, we explore bias from a religious perspective in Bengali, focusing specifically on two main religious dialects: Hindu and Muslim-majority dialects. Here, we perform different experiments and audit showing the comparative analysis of different sentences using three commonly used LLMs: ChatGPT, Gemini, and Microsoft Copilot, pertaining to the Hindu and Muslim dialects of specific words and showcasing which ones catch the social biases and which do not. Furthermore, we analyze our findings and relate them to potential reasons and evaluation perspectives, considering their global impact with over 300 million speakers worldwide. With this work, we hope to establish the rigor for creating more fairness in LLMs, as these are widely used as creative writing agents.
Paper Structure (17 sections, 3 figures)

This paper contains 17 sections, 3 figures.

Figures (3)

  • Figure 1: Primary evaluation without any specifications
  • Figure 2: Experimental results on different settings
  • Figure 3: Experimental results on different contexts without any specification