Table of Contents
Fetching ...

CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

Arnav Yayavaram, Siddharth Yayavaram, Simran Khanuja, Michael Saxon, Graham Neubig

TL;DR

CAIRe introduces a retrieval-augmented, knowledge-grounded framework for visual cultural attribution that scores how well an image aligns with user-defined culture labels on a 1–5 scale. It grounds visual content to BabelNet-derived entities via Visual Entity Linking and then uses vision-language models with retrieved Wikipedia context to produce culture-specific relevance judgments, enabling fine-grained, multi-label assessment. Two test sets—specific (rare, culturally salient concepts) and universal (culturally universal concepts, with generated and natural images and human judgments)—demonstrate CAIRe's strong alignment with human opinions (Pearson correlations up to 0.66) and substantial improvements over baselines (up to ~25 F1 points on the specific set). Beyond per-image attribution, CAIRe serves as a batch-level diagnostic for cultural skews in T2I outputs, highlighting its practical utility for evaluating and auditing cross-cultural fairness in vision-language systems. The framework is modular and extensible to different KBs and VLMs, offering a scalable tool to quantify and analyze cultural relevance across diverse image sources and cultures.

Abstract

As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, an evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 22% F1 points. Additionally, we construct two datasets for culturally universal concepts, one comprising T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation

TL;DR

CAIRe introduces a retrieval-augmented, knowledge-grounded framework for visual cultural attribution that scores how well an image aligns with user-defined culture labels on a 1–5 scale. It grounds visual content to BabelNet-derived entities via Visual Entity Linking and then uses vision-language models with retrieved Wikipedia context to produce culture-specific relevance judgments, enabling fine-grained, multi-label assessment. Two test sets—specific (rare, culturally salient concepts) and universal (culturally universal concepts, with generated and natural images and human judgments)—demonstrate CAIRe's strong alignment with human opinions (Pearson correlations up to 0.66) and substantial improvements over baselines (up to ~25 F1 points on the specific set). Beyond per-image attribution, CAIRe serves as a batch-level diagnostic for cultural skews in T2I outputs, highlighting its practical utility for evaluating and auditing cross-cultural fairness in vision-language systems. The framework is modular and extensible to different KBs and VLMs, offering a scalable tool to quantify and analyze cultural relevance across diverse image sources and cultures.

Abstract

As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered by trade-offs, including a loss in performance, factual inaccuracies, or offensive outputs. Despite widespread recognition of these challenges, an inability to reliably measure these biases has stalled progress. To address this gap, we introduce CAIRe, an evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Our framework grounds entities and concepts in the image to a knowledge base and uses factual information to give independent graded judgments for each culture label. On a manually curated dataset of culturally salient but rare items built using language models, CAIRe surpasses all baselines by 22% F1 points. Additionally, we construct two datasets for culturally universal concepts, one comprising T2I-generated outputs and another retrieved from naturally occurring data. CAIRe achieves Pearson's correlations of 0.56 and 0.66 with human ratings on these sets, based on a 5-point Likert scale of cultural relevance. This demonstrates its strong alignment with human judgment across diverse image sources.

Paper Structure

This paper contains 49 sections, 6 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: We propose CAIRe, a novel evaluation metric that assesses the degree of cultural relevance of an image, given a user-defined set of labels. Unlike existing methods that assume a definition of culture, we let the user specify cultures to assess as free-text labels (photo is of a djembe).
  • Figure 2: Overview of CAIRe. From an image-indexed multimodal knowledge base, we embed an input image to retrieve entities that are tied to Wikipedia articles. From the text of those Wikipedia articles and the query image, a vision-language model (VLM) generates an affinity score to each user-specified candidate culture label. A detailed description of our framework is in §\ref{['sec:metric-design']}.
  • Figure 3: Examples from the evaluation set: (a) image from the specific set depicting Azulejos. The label set of countries consists of Portugal, Spain, Brazil, Morocco, Mexico. (b) T2I-generated image using the prompt "A realistic photo of a ritual in India," representing the universal-generated subset. (c) image retrieved from DataComp 1B, using the text query "A realistic photo of greetings in Nigeria", illustrating the universal-retrieved subset. (d-f) additional examples from the specific set corresponding to cultural proxies religion, Bronze Age civilizations, and cities of Indonesia, (Buddhism, Sumer, and Magelang respectively).
  • Figure 4: CAIRe's use in evaluating diversity across T2I generated outputs (prompt: a photo of a wedding)
  • Figure 5: Retrieval Comparison Across Encoders
  • ...and 4 more figures