Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs
Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata
TL;DR
DebateBias-8K introduces a multilingual, debate-style benchmark to surface subtle narrative biases in open-ended LLM generation across seven languages and four high-impact domains. The dataset combines three-phase construction (seed creation, in-context expansion, multilingual translation) and a robust, automated classification pipeline to reveal persistent stereotypes even in safety-aligned models, with bias often amplifying in low-resource languages. Key findings show strong Arab associations with terrorism and religion, Western groups consistently framed as modern, and Africans linked to socioeconomic backwardness in several languages, with language-resource level modulating bias intensity. The work highlights a critical gap in multilingual fairness: English-centric alignment does not generalize globally, motivating multilingual adversarial training and culturally grounded alignment approaches. DebateBias-8K provides both the benchmark and an analysis framework to drive safer, more inclusive model behavior across linguistic and cultural contexts.
Abstract
Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.
