Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models
Srishti Yadav, Zhi Zhang, Daniel Hershcovich, Ekaterina Shutova
TL;DR
The paper addresses how cultural values are embedded and probed in multimodal vision-language models (VLMs) by using images as cultural proxies alongside World Values Survey questions. It introduces a workflow that prompts models with country cues and culture-specific images, and evaluates their value-alignment across 15 WVS topics and 10 image categories using MCQ-style prompts and Jensen-Shannon similarity to human responses. Across model sizes (13B, 34B, 72B), results reveal that images can improve cultural alignment for certain topics and contexts, but effects are highly variable by topic, country, and income level, with larger models not universally outperforming smaller ones. The work underscores the importance of multimodal cultural evaluation, reveals non-monotonic scale effects, and argues for diverse, culturally representative data and careful deployment to mitigate bias in global contexts.
Abstract
Investigating value alignment in Large Language Models (LLMs) based on cultural context has become a critical area of research. However, similar biases have not been extensively explored in large vision-language models (VLMs). As the scale of multimodal models continues to grow, it becomes increasingly important to assess whether images can serve as reliable proxies for culture and how these values are embedded through the integration of both visual and textual data. In this paper, we conduct a thorough evaluation of multimodal model at different scales, focusing on their alignment with cultural values. Our findings reveal that, much like LLMs, VLMs exhibit sensitivity to cultural values, but their performance in aligning with these values is highly context-dependent. While VLMs show potential in improving value understanding through the use of images, this alignment varies significantly across contexts highlighting the complexities and underexplored challenges in the alignment of multimodal models.
