Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
TL;DR
This study provides the first cross-language bias audit of four public multilingual CLIP variants across ten languages, revealing that multilinguality often amplifies gender and race stereotypes rather than mitigating them. Using template-based probes on FairFace and PATA and a unified embedding framework, the authors quantify bias via Max Skew and symmetric KL divergence, showing that shared encoders propagate English biases into gender-neutral languages while adapter-based designs can curtail some biases but worsen others in low-resource settings. Debiasing efforts like SigLIP‑2 reduce agency and communion skews but leave crime-associated stereotypes largely intact, especially under caption sparsity, exposing the center–periphery bias of English-centric alignment. The work highlights language-specific hot spots, the critical role of data scarcity and morphological gender marking, and the need for language-aware fairness objectives and reporting to ensure responsible deployment of multilingual vision–language systems.
Abstract
Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.
