Table of Contents
Fetching ...

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, Roy Ka-Wei Lee

TL;DR

BLEnD-Vis provides a rigorous, multimodal benchmark to probe robust, transferable everyday cultural knowledge in vision-language models across 16 regions. By aligning three formats—Original Region $ ightarrow$ Entity, Rephrased Entity $ ightarrow$ Region, and VQA-style (Image + Placeholder $ ightarrow$ Region)—the dataset enables controlled diagnostics of linguistic and visual grounding. Evaluations of 13 VLMs reveal brittleness to rephrasing, beneficial but not fully reliable visual cues, and substantial cross-modal consistency gaps, with model scale not reliably predicting success. Cross-modal fine-tuning shows textual grounding strongly enhances visual transfer, underscoring the foundational role of language-based representations in multimodal cultural understanding. The work highlights significant regional disparities and provides a foundation for developing more culturally aware, robust VLMs and targeted future benchmarks.

Abstract

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

TL;DR

BLEnD-Vis provides a rigorous, multimodal benchmark to probe robust, transferable everyday cultural knowledge in vision-language models across 16 regions. By aligning three formats—Original Region Entity, Rephrased Entity Region, and VQA-style (Image + Placeholder Region)—the dataset enables controlled diagnostics of linguistic and visual grounding. Evaluations of 13 VLMs reveal brittleness to rephrasing, beneficial but not fully reliable visual cues, and substantial cross-modal consistency gaps, with model scale not reliably predicting success. Cross-modal fine-tuning shows textual grounding strongly enhances visual transfer, underscoring the foundational role of language-based representations in multimodal cultural understanding. The work highlights significant regional disparities and provides a foundation for developing more culturally aware, robust VLMs and targeted future benchmarks.

Abstract

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region Entity, (ii) an inverted text-only variant (Entity Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.

Paper Structure

This paper contains 40 sections, 11 figures, 18 tables.

Figures (11)

  • Figure 1: Overview of the BLEnD-Vis benchmark construction and evaluation framework. The process involves: (1) Data Construction via tangibility filtering, question rephrasing, and image generation based on BLEnD; (2) Human Validation of generated assets; (3) Creation of the final BLEnD-Vis Dataset comprising three parallel MCQ formats (Original Text, Rephrased Text, VQA) across 16 regions; and (4) Evaluation assessing VLM zero-shot accuracy, robustness to rephrasing, cross-modal consistency, and regional performance variations.
  • Figure 2: Accuracies (%) of each evaluated VLM for the VQA-Style MCQ format in BLEnD-Vis (Full Dataset) across 16 different cultural regions (see Appendix \ref{['appendix:region_codes']}, Table \ref{['tab:appendix_region_code_map']} for region code definitions), highlighting regional variations in model performance. (Original and Rephrased text-only formats are in Appendix \ref{['appendix:region_model_text_performance']}.)
  • Figure 3: Original Text-Only MCQ Performance (%) by Region and Model on BLEnD-Vis (Full Dataset).
  • Figure 4: Rephrased Text-Only MCQ Performance (%) by Region and Model on BLEnD-Vis (Full Dataset).
  • Figure 5: ID: Ca-sp-45, Topic: Family, Region: Iran, Target Answer: 'north'. Reason Flagged 'BAD': The image of a family in a forest is too generic and lacks specific visual cues to represent a destination in the 'north' of Iran, failing to convey the intended concept. Failure mode: Ambiguous/Unclear Representation.
  • ...and 6 more figures