Table of Contents
Fetching ...

CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts

Malvina Nikandrou, Georgios Pantazopoulos, Nikolas Vitsakis, Ioannis Konstas, Alessandro Suglia

TL;DR

CROPE introduces a culture-focused probing benchmark for Vision-Language Models to assess recognition and in-context adaptation of culture-specific concepts. It disentangles parametric knowledge from contextual knowledge by testing zero-shot performance and four contextual conditions (textual, visual, and multimodal) using a large set of hard negatives derived from cultural concepts. Across open VLMs, CROPE reveals large gaps between culture-specific and common concepts in zero-shot settings, and finds that contextual information often fails to improve—and can even degrade—performance, with humans benefiting from multimodal context. The work highlights significant limitations in current cultural understanding and multimodal reasoning, advocating for more inclusive VLMs and providing a benchmark to drive progress in in-context cultural adaptation.

Abstract

As Vision and Language models (VLMs) are reaching users across the globe, assessing their cultural understanding has become a critical challenge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.

CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts

TL;DR

CROPE introduces a culture-focused probing benchmark for Vision-Language Models to assess recognition and in-context adaptation of culture-specific concepts. It disentangles parametric knowledge from contextual knowledge by testing zero-shot performance and four contextual conditions (textual, visual, and multimodal) using a large set of hard negatives derived from cultural concepts. Across open VLMs, CROPE reveals large gaps between culture-specific and common concepts in zero-shot settings, and finds that contextual information often fails to improve—and can even degrade—performance, with humans benefiting from multimodal context. The work highlights significant limitations in current cultural understanding and multimodal reasoning, advocating for more inclusive VLMs and providing a benchmark to drive progress in in-context cultural adaptation.

Abstract

As Vision and Language models (VLMs) are reaching users across the globe, assessing their cultural understanding has become a critical challenge. In this paper, we introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts and evaluate the capacity for cultural adaptation through contextual information. This allows us to distinguish between parametric knowledge acquired during training and contextual knowledge provided during inference via visual and textual descriptions. Our evaluation of several state-of-the-art open VLMs shows large performance disparities between culture-specific and common concepts in the parametric setting. Moreover, experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions. Our findings reveal limitations in the cultural understanding and adaptability of current VLMs that need to be addressed toward more culturally inclusive models.

Paper Structure

This paper contains 44 sections, 13 figures, 10 tables.

Figures (13)

  • Figure 1: CROPE probes the cultural knowledge of VLMs and assesses the effect of contextual information. Each dataset sample poses a question about the presence of a culture-specific concept within an image and is paired with demonstrative text and images that can be used as additional context to improve understanding.
  • Figure 2: Overview of the dataset creation methodology. We start from a collection of concepts from geographically diverse languages. We collect a pool of challenging negative candidates from Wikidata and by prompting an LLM. Then, we use a VLM to rank candidates and sample up to three candidates per image. To verify each example, we ask human annotators who are proficient in the original concept language and English to annotate the images. Finally, we aggregate the labels and filter out ambiguous examples.
  • Figure 3: Zero-shot F1 score per source language.
  • Figure 4: Performance with different context types. All VLMs are negatively impacted when including the concept summary in question (Textual Context). Out of the 7 VLMs that accept multimodal context, only XGEN-MM and InternLM-XComposer benefit from multimodal contextual information.
  • Figure 5: Relative performance of Zero-shot vs Textual conditions for the original (top) and easy (bottom) versions of CROPE. Textual summaries benefit most models when differentiating between easier candidates.
  • ...and 8 more figures