Table of Contents
Fetching ...

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie

TL;DR

Ravenea addresses the gap in visual culture understanding for vision-language models by introducing a retrieval-augmented benchmark that links culturally grounded images with externally sourced knowledge. It combines CVQA and CCUB into a large, human-annotated dataset using a three-stage construction pipeline and evaluates seven culture-aware retrievers across 14 VLMs, showing that retrieval augmentation improves performance, especially for lightweight models. The paper introduces Culture-Aware Contrastive (CAC) learning to train retrievers (CaCLIP, CaSigLIP2) and demonstrates that CaCLIP achieves the best overall gains, including increases in cVQA accuracy and cIC RegionScore, while highlighting cross-country biases and limits of scale. Overall, Ravenea demonstrates the value of retrieval-augmented multimodal methods for culturally grounded understanding and outlines directions for broader coverage, richer metrics, and further analysis of cultural biases in multimodal models.

Abstract

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

TL;DR

Ravenea addresses the gap in visual culture understanding for vision-language models by introducing a retrieval-augmented benchmark that links culturally grounded images with externally sourced knowledge. It combines CVQA and CCUB into a large, human-annotated dataset using a three-stage construction pipeline and evaluates seven culture-aware retrievers across 14 VLMs, showing that retrieval augmentation improves performance, especially for lightweight models. The paper introduces Culture-Aware Contrastive (CAC) learning to train retrievers (CaCLIP, CaSigLIP2) and demonstrates that CaCLIP achieves the best overall gains, including increases in cVQA accuracy and cIC RegionScore, while highlighting cross-country biases and limits of scale. Overall, Ravenea demonstrates the value of retrieval-augmented multimodal methods for culturally grounded understanding and outlines directions for broader coverage, richer metrics, and further analysis of cultural biases in multimodal models.

Abstract

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.

Paper Structure

This paper contains 34 sections, 6 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Effectiveness of culture-aware RAG. Given a culturally grounded visual question, VLMs enhanced with culture-aware RAG—retrieving relevant Wikipedia documents—generate more accurate answers than their non-RAG counterparts. (see Section \ref{['sec:overall_results']}).
  • Figure 2: Ravenea: A Multimodal Retrieval-Augmented Visual culturE uNdErstAnding dataset. Left: Examples of cVQA and cIC tasks. Middle: Geographic and categorical distribution of cultural references. Right: Performance comparison of 14 VLMs, evaluated with and without integration of our culture-aware retriever. Here, CaCLIP="culture-aware CLIP-L/14@224px".
  • Figure 3: Ravenea construction pipeline.Left: A two-stage retrieval process to match each image with relevant documents. Middle: Decomposition of cultural relevance into three interpretable dimensions to improve human annotation. Right: Postprocessing methods for quality control.
  • Figure 4: Examples demonstrating the impact of CaCLIP Wikipedia retrieval integration on cVQA and cIC tasks using DeepseekVL2-Tiny. When augmented with culture-aware retrieval, the model exhibits enhanced sensitivity to cultural context.
  • Figure 5: Performance improvements for smallest and largest models per family with multimodal retrievers. Scaling models yields marginal gains with various retrievers, even negative effects in both cVQA and cIC tasks. "ACC." denotes accuracy; "R.S." refers to the RegionScore; "$\Delta$" represents the change incorporated with RAG compared to the non-RAG baseline.
  • ...and 10 more figures