Table of Contents
Fetching ...

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Zahraa Al Sahili, Ioannis Patras, Matthew Purver

TL;DR

This study provides the first cross-language bias audit of four public multilingual CLIP variants across ten languages, revealing that multilinguality often amplifies gender and race stereotypes rather than mitigating them. Using template-based probes on FairFace and PATA and a unified embedding framework, the authors quantify bias via Max Skew and symmetric KL divergence, showing that shared encoders propagate English biases into gender-neutral languages while adapter-based designs can curtail some biases but worsen others in low-resource settings. Debiasing efforts like SigLIP‑2 reduce agency and communion skews but leave crime-associated stereotypes largely intact, especially under caption sparsity, exposing the center–periphery bias of English-centric alignment. The work highlights language-specific hot spots, the critical role of data scarcity and morphological gender marking, and the need for language-aware fairness objectives and reporting to ensure responsible deployment of multilingual vision–language systems.

Abstract

Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

TL;DR

This study provides the first cross-language bias audit of four public multilingual CLIP variants across ten languages, revealing that multilinguality often amplifies gender and race stereotypes rather than mitigating them. Using template-based probes on FairFace and PATA and a unified embedding framework, the authors quantify bias via Max Skew and symmetric KL divergence, showing that shared encoders propagate English biases into gender-neutral languages while adapter-based designs can curtail some biases but worsen others in low-resource settings. Debiasing efforts like SigLIP‑2 reduce agency and communion skews but leave crime-associated stereotypes largely intact, especially under caption sparsity, exposing the center–periphery bias of English-centric alignment. The work highlights language-specific hot spots, the critical role of data scarcity and morphological gender marking, and the need for language-aware fairness objectives and reporting to ensure responsible deployment of multilingual vision–language systems.

Abstract

Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

Paper Structure

This paper contains 57 sections, 13 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Gender max‑skew on FairFace.(a) Low‑resource languages (hi, xh, pt). (b) High‑resource languages (en, es, fr). Bars show crime, communion and agency skews for the four multilingual checkpoints. Spikes for CAPIVARA in Xhosa and SigLIP2 in Hindi reveal how data scarcity can inflate gender–crime associations even when corresponding English skews remain modest.
  • Figure 2: Race mean‑max‑skew on FairFace.(a) Low‑resource languages (hi, xh, pt). (b) High‑resource languages (en, es, fr). Mean‑max‑skew averages disparities over all race pairs; the tallest bars confirm that race–crime stereotypes intensify under the shared‑encoder (NLLB‑CLIP) in Hindi and explode for SigLIP2 in Xhosa.
  • Figure 3: Gender max‑skew on FairFace by grammatical system.(a) Gender‑neutral languages (tr, fa, fi). (b) Highly gendered languages (es, fr, sl). Replacing CLIP’s text tower with a shared multilingual encoder (NLLB‑CLIP) leaves gender‑neutral skews small, whereas adapter‑based CAPIVARA and Web‑scale SigLIP2 show sharp increases once overt grammatical gender is present.
  • Figure 4: Race mean‑max‑skew on FairFace by grammatical system.(a) Gender‑neutral languages (tr, fa, fi). (b) Highly gendered languages (es, fr, sl). Race skews rise most for the loosely coupled CAPIVARA adapters in gender‑neutral Turkish ($3.25$) and for SigLIP2 in gendered French ($5.58$), underscoring that grammatical gender can interact with race biases in non‑obvious ways.
  • Figure 5: Symmetric KL divergence for gender on FairFace by morphological class.Left (a) gender‑neutral languages; right (b) highly gendered languages. KL underscores architecture‑specific risks: SigLIP2 remains nearly unbiased in neutral tongues, whereas CAPIVARA exhibits extreme communion divergence in French ($\mathrm{SKL}_{\text{comm}}\!>\!0.5$), confirming that grammatical gender can amplify underlying stereotypes.
  • ...and 5 more figures