Table of Contents
Fetching ...

GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

Florian Schneider, Carolin Holtermann, Chris Biemann, Anne Lauscher

TL;DR

GIMMICK introduces a globally inclusive multimodal cultural knowledge benchmark for LVLMs and LLMs, built on UNESCO's Intangible Cultural Heritage data to evaluate six tasks across three datasets. It spans 728 cultural events across 144 countries and six macro-regions, with 57 cultural aspects, and assesses 31 models (proprietary and open-weight) to study regional biases, model-size effects, multimodal input, and external country cues. The study reports a pronounced Western bias across models and tasks, shows that larger models generally perform better and reduce some biases, and demonstrates that multimodal inputs plus country-level cues improve performance, particularly for underrepresented regions. The results highlight that while models capture broad cultural categories, they struggle with nuanced, intangible cultural knowledge, underscoring the need for more globally inclusive AI and future work.

Abstract

Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.

GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

TL;DR

GIMMICK introduces a globally inclusive multimodal cultural knowledge benchmark for LVLMs and LLMs, built on UNESCO's Intangible Cultural Heritage data to evaluate six tasks across three datasets. It spans 728 cultural events across 144 countries and six macro-regions, with 57 cultural aspects, and assesses 31 models (proprietary and open-weight) to study regional biases, model-size effects, multimodal input, and external country cues. The study reports a pronounced Western bias across models and tasks, shows that larger models generally perform better and reduce some biases, and demonstrates that multimodal inputs plus country-level cues improve performance, particularly for underrepresented regions. The results highlight that while models capture broad cultural categories, they struggle with nuanced, intangible cultural knowledge, underscoring the need for more globally inclusive AI and future work.

Abstract

Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.

Paper Structure

This paper contains 55 sections, 1 equation, 147 figures, 18 tables.

Figures (147)

  • Figure 1: An overview of the GIMMICK benchmark and its tasks.
  • Figure 2: Aggregated results of the VQA tasks.
  • Figure 3: CIVQA ground-truth answer perplexity.
  • Figure 4: Model size vs. performance on GIMMICK tasks. The x-axis is in log scale. The trend line was computed using OLS regression. We report the Pearson correlation coefficient $r$ ( * indicates statistical significance).
  • Figure 5: Relative Difference to W for CIVQA.
  • ...and 142 more figures