Table of Contents
Fetching ...

Vision Language Models are Confused Tourists

Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

TL;DR

This work tackles the problem of cultural robustness in vision-language understanding by introducing ConfusedTourist, a robustness suite with 5,451 images across 243 cultural items from 57 countries. It combines context crawling, adversarial pairing, and two perturbation strategies (image stacking and generative perturbations) to systematically evaluate grounding under conflicting cultural cues. Across 14 state-of-the-art systems, the study reveals substantial accuracy drops—most pronounced with generative perturbations and flag cues—and shows that models increasingly rely on distractor cues as accuracy declines. The findings highlight a critical need for culturally robust, globally aware multimodal reasoning and offer interpretability insights and potential mitigation pathways via prompt design and token-level ablations.

Abstract

Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

Vision Language Models are Confused Tourists

TL;DR

This work tackles the problem of cultural robustness in vision-language understanding by introducing ConfusedTourist, a robustness suite with 5,451 images across 243 cultural items from 57 countries. It combines context crawling, adversarial pairing, and two perturbation strategies (image stacking and generative perturbations) to systematically evaluate grounding under conflicting cultural cues. Across 14 state-of-the-art systems, the study reveals substantial accuracy drops—most pronounced with generative perturbations and flag cues—and shows that models increasingly rely on distractor cues as accuracy declines. The findings highlight a critical need for culturally robust, globally aware multimodal reasoning and offer interpretability insights and potential mitigation pathways via prompt design and token-level ablations.

Abstract

Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.

Paper Structure

This paper contains 37 sections, 7 equations, 12 figures, 5 tables, 2 algorithms.

Figures (12)

  • Figure 1: Our ConfusedTourist construction pipeline. The pipeline consists of 3 stages: (1) Context Crawling to obtain balanced, culturally diverse item data and descriptions; (2) Pair & image creation where we generate hard and easy cultural pairings and produce various perturbation-infused visual cases; and (3) Evaluation, where we assess VLMs' concept grounding ability using objective metrics and interpretability analysis.
  • Figure 2: Overall evaluation results in average accuracy of country & cultural item prediction. Key trends: (a) Proprietary VLMs outperform open-weight variants, with generative perturbations being more adverse (especially with flags). (b) Predicting cultural item name is more challenging even from baseline case, though country accuracy drops are much larger in both difficulty levels. (c) Similar average performance of both pairing methods
  • Figure 3: The negative correlation between country prediction accuracy vs. distraction likelihood of the model in wrongly predicted cases. (a) The proportion of wrongly predicted countries across models increases along with the decrease of country prediction accuracy. (b) Across 11 different subregions for each VLM, the correlation of this relationship is also scoring at $-0.76$, suggesting a strong negative relation between the metrics.
  • Figure 4: GPT-5 (High$^+$) results. (a) Global map of accuracy drop, computed as the ratio between performance difference and original score. (b) Distribution of predicted countries where the model is incorrect but does not follow the adversarial country
  • Figure 5: Attention heatmap analysis indicates that visual grounding primarily arises from a limited set of tokens. In (a) attire, (b) cuisine, and (c) musical-instrument culture items, tokens linked to system cues, geographic references, and category-specific terms dominate the model’s visual attention.
  • ...and 7 more figures