Table of Contents
Fetching ...

Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning

Haorui Yu, Yang Zhao, Yijia Chu, Qiufeng Yi

TL;DR

The paper addresses whether Vision-Language Models truly understand cultural semantics or merely rely on symbolic associations when interpreting fire-themed imagery. It introduces a diagnostic framework—combining a controlled, culturally diverse dataset with zero-shot classification and explanation analysis—to reveal reasoning patterns beyond accuracy. Findings show a systemic reliance on symbolic shortcuts, Western-centric bias, and safety-critical misclassifications in emergency contexts, underscoring data bias and fairness issues. The work advocates a shift toward interpretability and culture-aware evaluation to build culturally robust multimodal systems with safer, more accurate reasoning in real-world settings.

Abstract

Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning

TL;DR

The paper addresses whether Vision-Language Models truly understand cultural semantics or merely rely on symbolic associations when interpreting fire-themed imagery. It introduces a diagnostic framework—combining a controlled, culturally diverse dataset with zero-shot classification and explanation analysis—to reveal reasoning patterns beyond accuracy. Findings show a systemic reliance on symbolic shortcuts, Western-centric bias, and safety-critical misclassifications in emergency contexts, underscoring data bias and fairness issues. The work advocates a shift toward interpretability and culture-aware evaluation to build culturally robust multimodal systems with safer, more accurate reasoning in real-world settings.

Abstract

Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

Paper Structure

This paper contains 17 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Confusion Matrices for GPT-4o (left) and Qwen 2.5-VL 7B (right). The rows represent the true cultural labels, and the columns represent the predicted labels. These matrices reveal the specific patterns of misclassification for the highest and lowest-performing models.