See It from My Perspective: How Language Affects Cultural Bias in Image Understanding
Amith Ananthram, Elias Stengel-Eskin, Mohit Bansal, Kathleen McKeown
TL;DR
This work systematically analyzes Western bias in vision–language models (VLMs) and isolates language as a contributing factor. It introduces a two-step framework: Step 1 characterizes bias in off-the-shelf VLMs across Western vs East Asian splits, and Step 2 sources bias by training multilingual variants (mLLaVA) that vary base LLM pre-training language, prompting language, and fusion-language mix. The study shows Western bias is widespread, but substantial reductions are achievable when non-English pre-training includes more Chinese and when prompting aligns with culturally closer languages; however, fusion data alone cannot substitute for rich multilingual pre-training. These findings highlight the importance of multilingual foundation models and language-diverse pre-training for building fairer VLMs and mitigating hegemonic biases in AI systems.
Abstract
Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from East Asian cultures attend more to scene context. In this work, we characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity. We evaluate VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western split than on the East Asian split of each task. Through controlled experimentation, we trace one source of this bias in image understanding to the lack of diversity in language model construction. While inference in a language nearer to a culture can lead to reductions in bias, we show it is much more effective when that language was well-represented during text-only pre-training. Interestingly, this yields bias reductions even when prompting in English. Our work highlights the importance of richer representation of all languages in building equitable VLMs.
