Benchmarking Vision Language Models for Cultural Understanding

Shravan Nayak; Kanishk Jain; Rabiul Awal; Siva Reddy; Sjoerd van Steenkiste; Lisa Anne Hendricks; Karolina Stańczak; Aishwarya Agrawal

Benchmarking Vision Language Models for Cultural Understanding

Shravan Nayak, Kanishk Jain, Rabiul Awal, Siva Reddy, Sjoerd van Steenkiste, Lisa Anne Hendricks, Karolina Stańczak, Aishwarya Agrawal

TL;DR

This paper introduces CulturalVQA, a 2,378-question, 2,328-image benchmark designed to test geo-diverse cultural understanding in vision-language models across 11 countries and five cultural facets. It combines culturally informed image selection from the CANDLE dataset with annotator-generated questions and concise, culturally precise answers, evaluated with a reference-based LAVE metric using GPT-4 as judge. Evaluation across closed- and open-source models reveals strong regional disparities, with Africa underrepresented concepts proving more challenging than North American ones, and a clear gap between proprietary and open models. The work demonstrates both the current limits of cultural comprehension in VLMs and the utility of CulturalVQA as a diagnostic tool to drive progress, including facet- and language-related analyses and qualitative failure analyses.

Abstract

Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.

Benchmarking Vision Language Models for Cultural Understanding

TL;DR

Abstract

Paper Structure (54 sections, 15 figures, 7 tables)

This paper contains 54 sections, 15 figures, 7 tables.

Introduction
Related work
Cultural Taxonomy
Selection of Countries
Selection of Images
Question Collection
Answer Collection
Dataset Analysis
Images
Questions
Answers
Cultural Concepts
Evaluating VLMs on CulturalVQA
Evaluation Metric
VLMs used for evaluation
...and 39 more sections

Figures (15)

Figure 1: The performance of VLMs over time, segmented by non-Western (red) and Western (blue) countries, with model release dates annotated (bottom). Dashed and solid lines differentiate trends for non-Western and Western countries respectively. VLMs' understanding of non-Western cultures has been in a steep upward trend since Jan '24.
Figure 2: Samples from CulturalVQA. Our dataset is comprised of images presenting cultural concepts from 11 countries across five facets: traditions, rituals, food, drink, and clothing. It further includes questions probing cultural understanding of the concepts presented in the images and answers to these questions.
Figure 3: Comparative analysis of data by country. The figure presents three aspects: (A) unique counts of images, questions, and answers, (B) average lengths of questions and answers, and (C) average number of answers per question and inter-annotator agreement scores across countries, showcasing variations and trends in CulturalVQA.
Figure 4: Word clouds representing the answers in CulturalVQA across five facets of culture: clothing, drink, food, rituals, and traditions. In the bottom right, a breakdown of cultural facets in data is depicted.
Figure 5: Baseline evaluation of the degree of visual understanding required in CulturalVQA: LLM-only, LLM with a country-specific context, LLM with Google Lens entities, and GPT-4V.
...and 10 more figures

Benchmarking Vision Language Models for Cultural Understanding

TL;DR

Abstract

Benchmarking Vision Language Models for Cultural Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (15)