Table of Contents
Fetching ...

Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata

TL;DR

DebateBias-8K introduces a multilingual, debate-style benchmark to surface subtle narrative biases in open-ended LLM generation across seven languages and four high-impact domains. The dataset combines three-phase construction (seed creation, in-context expansion, multilingual translation) and a robust, automated classification pipeline to reveal persistent stereotypes even in safety-aligned models, with bias often amplifying in low-resource languages. Key findings show strong Arab associations with terrorism and religion, Western groups consistently framed as modern, and Africans linked to socioeconomic backwardness in several languages, with language-resource level modulating bias intensity. The work highlights a critical gap in multilingual fairness: English-centric alignment does not generalize globally, motivating multilingual adversarial training and culturally grounded alignment approaches. DebateBias-8K provides both the benchmark and an analysis framework to drive safer, more inclusive model behavior across linguistic and cultural contexts.

Abstract

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

TL;DR

DebateBias-8K introduces a multilingual, debate-style benchmark to surface subtle narrative biases in open-ended LLM generation across seven languages and four high-impact domains. The dataset combines three-phase construction (seed creation, in-context expansion, multilingual translation) and a robust, automated classification pipeline to reveal persistent stereotypes even in safety-aligned models, with bias often amplifying in low-resource languages. Key findings show strong Arab associations with terrorism and religion, Western groups consistently framed as modern, and Africans linked to socioeconomic backwardness in several languages, with language-resource level modulating bias intensity. The work highlights a critical gap in multilingual fairness: English-centric alignment does not generalize globally, motivating multilingual adversarial training and culturally grounded alignment approaches. DebateBias-8K provides both the benchmark and an analysis framework to drive safer, more inclusive model behavior across linguistic and cultural contexts.

Abstract

Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

Paper Structure

This paper contains 24 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Example of an open-ended debate-style prompt simulating expert debates across cultural backgrounds. This approach enables unconstrained generation across multiple languages to reveal subtle biases that might remain hidden in more constrained evaluation paradigms. See Section \ref{['sec:prompt_types']} for details.
  • Figure 2: DebateBias-8K construction pipeline. Semi-automatic seed prompts (50 per domain) were expanded with model assistance, schema-validated, de-duplicated, translated into six additional languages, and evaluated through sampled back-translation audits (0.90 similarity threshold). See §\ref{['sec:dataset_generation']} for details