Table of Contents
Fetching ...

Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

Avinash Madasu, Vasudev Lal, Phillip Howard

TL;DR

This work introduces a cross-country evaluation framework for Vision-Language Models (VLMs) to quantify cultural awareness and biases across six racial groups using FairFace and SocialCounterfactuals. It defines a Retrieval across Countries task with the metric $R@C$ and evaluates three probes—Race to Country, Personal Traits to Country, and Physical Characteristics to Country—across four VLMs (ALIP, LACLIP, OpenCLIP, BLIP-2). Results reveal persistent racial-regional biases and model-dependent stereotype mappings that differ by dataset and task, with some models showing severe overgeneralization. The findings underscore the need for culturally aware training and debiasing to improve the global applicability and fairness of VLMs. The work provides a formalized, scalable approach to assess multicultural competence in visual-text embeddings and highlights practical implications for deploying VLMs in diverse cultural contexts.

Abstract

Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries. We introduce three retrieval-based tasks: (1) Race to Country retrieval, which examines the association between individuals from specific racial groups (East Asian, White, Middle Eastern, Latino, South Asian, and Black) and different countries; (2) Personal Traits to Country retrieval, where images are paired with trait-based prompts (e.g., Smart, Honest, Criminal, Violent) to investigate potential stereotypical associations; and (3) Physical Characteristics to Country retrieval, focusing on visual attributes like skinny, young, obese, and old to explore how physical appearances are culturally linked to nations. Our findings reveal persistent biases in VLMs, highlighting how visual representations may inadvertently reinforce societal stereotypes.

Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

TL;DR

This work introduces a cross-country evaluation framework for Vision-Language Models (VLMs) to quantify cultural awareness and biases across six racial groups using FairFace and SocialCounterfactuals. It defines a Retrieval across Countries task with the metric and evaluates three probes—Race to Country, Personal Traits to Country, and Physical Characteristics to Country—across four VLMs (ALIP, LACLIP, OpenCLIP, BLIP-2). Results reveal persistent racial-regional biases and model-dependent stereotype mappings that differ by dataset and task, with some models showing severe overgeneralization. The findings underscore the need for culturally aware training and debiasing to improve the global applicability and fairness of VLMs. The work provides a formalized, scalable approach to assess multicultural competence in visual-text embeddings and highlights practical implications for deploying VLMs in diverse cultural contexts.

Abstract

Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts, yet their internal biases remain poorly understood. In this work, we propose a novel framework to systematically evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries. We introduce three retrieval-based tasks: (1) Race to Country retrieval, which examines the association between individuals from specific racial groups (East Asian, White, Middle Eastern, Latino, South Asian, and Black) and different countries; (2) Personal Traits to Country retrieval, where images are paired with trait-based prompts (e.g., Smart, Honest, Criminal, Violent) to investigate potential stereotypical associations; and (3) Physical Characteristics to Country retrieval, focusing on visual attributes like skinny, young, obese, and old to explore how physical appearances are culturally linked to nations. Our findings reveal persistent biases in VLMs, highlighting how visual representations may inadvertently reinforce societal stereotypes.

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Performance comparison of different VLMs on various races and retrieved countries as a percentage of total images on FairFace dataset.
  • Figure 2: Performance comparison of different VLMs on various races and retrieved countries as a percentage of total images on Socialcounterfactuals dataset
  • Figure 3: Figure shows the top-5 countries retrieved for ALIP, LACLIP and OPENCLIP. The images retrieved are for skinny physical characteristics.
  • Figure 4: Figure shows the top-5 countries retrieved for ALIP, LACLIP and OPENCLIP. The images retrieved are for young physical characteristics.
  • Figure 5: Figure shows the top-5 countries retrieved for ALIP, LACLIP and OPENCLIP. The images retrieved are for obese physical characteristics.
  • ...and 2 more figures