Table of Contents
Fetching ...

See It from My Perspective: How Language Affects Cultural Bias in Image Understanding

Amith Ananthram, Elias Stengel-Eskin, Mohit Bansal, Kathleen McKeown

TL;DR

This work systematically analyzes Western bias in vision–language models (VLMs) and isolates language as a contributing factor. It introduces a two-step framework: Step 1 characterizes bias in off-the-shelf VLMs across Western vs East Asian splits, and Step 2 sources bias by training multilingual variants (mLLaVA) that vary base LLM pre-training language, prompting language, and fusion-language mix. The study shows Western bias is widespread, but substantial reductions are achievable when non-English pre-training includes more Chinese and when prompting aligns with culturally closer languages; however, fusion data alone cannot substitute for rich multilingual pre-training. These findings highlight the importance of multilingual foundation models and language-diverse pre-training for building fairer VLMs and mitigating hegemonic biases in AI systems.

Abstract

Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from East Asian cultures attend more to scene context. In this work, we characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity. We evaluate VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western split than on the East Asian split of each task. Through controlled experimentation, we trace one source of this bias in image understanding to the lack of diversity in language model construction. While inference in a language nearer to a culture can lead to reductions in bias, we show it is much more effective when that language was well-represented during text-only pre-training. Interestingly, this yields bias reductions even when prompting in English. Our work highlights the importance of richer representation of all languages in building equitable VLMs.

See It from My Perspective: How Language Affects Cultural Bias in Image Understanding

TL;DR

This work systematically analyzes Western bias in vision–language models (VLMs) and isolates language as a contributing factor. It introduces a two-step framework: Step 1 characterizes bias in off-the-shelf VLMs across Western vs East Asian splits, and Step 2 sources bias by training multilingual variants (mLLaVA) that vary base LLM pre-training language, prompting language, and fusion-language mix. The study shows Western bias is widespread, but substantial reductions are achievable when non-English pre-training includes more Chinese and when prompting aligns with culturally closer languages; however, fusion data alone cannot substitute for rich multilingual pre-training. These findings highlight the importance of multilingual foundation models and language-diverse pre-training for building fairer VLMs and mitigating hegemonic biases in AI systems.

Abstract

Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from East Asian cultures attend more to scene context. In this work, we characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity. We evaluate VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western split than on the East Asian split of each task. Through controlled experimentation, we trace one source of this bias in image understanding to the lack of diversity in language model construction. While inference in a language nearer to a culture can lead to reductions in bias, we show it is much more effective when that language was well-represented during text-only pre-training. Interestingly, this yields bias reductions even when prompting in English. Our work highlights the importance of richer representation of all languages in building equitable VLMs.
Paper Structure (47 sections, 2 equations, 6 figures, 6 tables)

This paper contains 47 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Whose perspective do VLMs model? Despite being multilingual, state-of-the-art VLMs exhibit a bias toward imagery and perspectives from Western culture. A more balanced language mix during text-only pre-training produces VLMs that are both multilingual and multicultural.
  • Figure 2: Our approach. Step 1: We measure the Western bias of off-the-shelf ($\text{OTS}_{i}$) VLMs on culturally diverse image understanding tasks by comparing their performance on each task's Western and East Asian splits. Step 2: We train comparable multilingual VLMs (mLLaVA). We explore three model design choices focused on language: (A) the language mix in the pre-training corpus of the base LLM; (B) the prompting language; and (C) the language mix in the multimodal fusion corpus. We test each mLLaVA variant, measuring the effects of (A), (B), and (C) on Western bias.
  • Figure 3: Examples from our culturally diverse image understanding tasks which range from the objective to the subjective: object identification, question answering and art emotion classification.
  • Figure 4: The bias (Western performance divided by East Asian performance) of each of our $\text{OTS}_{i}$ VLMs when prompted in English () and Chinese (). Markers that fall in the gold / purple regions indicate Western and East Asian biases respectively. While Western bias reductions are seen across all tasks when prompting in Chinese, they are not seen consistently (in only 15/30 cases).
  • Figure 5: The change in bias (Western performance divided by East Asian performance) of each our mLLaVA variants. Markers that fall in the gold region indicate a Western bias; in the purple region, an East Asian bias. Unbroken lines indicate bias reductions that are significant at the $\alpha = 0.05$ level.
  • ...and 1 more figures