CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Shudong Liu; Yiqiao Jin; Cheng Li; Derek F. Wong; Qingsong Wen; Lichao Sun; Haipeng Chen; Xing Xie; Jindong Wang

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, Jindong Wang

TL;DR

The paper tackles the gap in culturally aware vision-language models by introducing CultureVerse, a large-scale multimodal benchmark spanning 188 countries with 19,682 tangible cultural concepts and 3 question types. It provides a scalable data collection and QA pipeline, enabling the development of CultureVLMs via fine-tuning that improve multicultural understanding while preserving general VLM performance. Key findings show persistent Western bias, substantial gains from fine-tuning, and strong cross-cultural generalization, though study limits include language proxies for culture and MCQ-only evaluation. The work offers a foundation for more equitable, globally representative multimodal AI systems and provides actionable insights into data diversity, model tuning, and cultural forgetting dynamics.

Abstract

Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

TL;DR

Abstract

Paper Structure (26 sections, 9 figures, 10 tables)

This paper contains 26 sections, 9 figures, 10 tables.

Introduction
Related Work
CultureVerse: A Scalable Benchmark for VLM Cultural Understanding
Tangible Cultural Concept Collection
Question-Answer Generation
Quality Assurance
Scalability
Analysis of CultureVerse
Experiments with CultureVerse
Experimental Setup
Main Results
Training CultureVLM
Generalization and Robustness
Catastrophic Forgetting
Case Study
...and 11 more sections

Figures (9)

Figure 1: Our pipeline to build CultureVerse and CultureVLM.
Figure 2: Overview of CultureVerse. In total, there are over 220k instances and 19k cultural concepts for training and evaluation, respectively, composed of 3 different types of questions from 188 countries.
Figure 3: Accuracy of different models on three tasks (upper), five regions (middle), and three categories of concepts (lower).
Figure 4: Results and analysis of CultureVLM by fine-tuning on our CultureVerse.
Figure 5: Generalization and Robustness. Left: Performance of CultureVLM (y-axis) evaluated across data from different continents (x-axis). CultureVLM achieves the highest performance for in-distribution settings, while still demonstrating strong generalizability for out-of-domain settings. Right: CultureVLM fine-tuned with data under different categories of CultureVerse (x-axis) and evaluated across various categories (y-axis). CHT denotes Cultural Heritage and Traditions; HL denotes History and Landmarks; and NELR denotes Natural Environment and Local Resources.
...and 4 more figures

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

TL;DR

Abstract

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Authors

TL;DR

Abstract

Table of Contents

Figures (9)