Table of Contents
Fetching ...

Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries

Tushar Pranav, Eshan Pandey, Austria Lyka Diane Bala, Aman Chadha, Indriyati Atmosukarto, Donny Soh Cheng Lock

TL;DR

RICE-VL introduces a culturally grounded benchmark for Vision-Language Models across 11 ASEAN countries, with two core tasks: Cultural Visual Question Answering and Cultural Visual Grounding. The dataset comprises over 28,000 VQA items and 1,000 grounding annotations, evaluated using the SEA-LAVE metric that accounts for textual, cultural, and country-specific alignment. Findings show persistent Western-centric biases in state-of-the-art VLMs, with closed-source models outperforming open ones and performance varying significantly by country; region-specific prompts further improve cultural localization. The work underscores the need for regionally aware data, evaluation protocols, and model training to achieve equitable multimodal AI across diverse cultural contexts.

Abstract

Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.

Rice-VL: Evaluating Vision-Language Models for Cultural Understanding Across ASEAN Countries

TL;DR

RICE-VL introduces a culturally grounded benchmark for Vision-Language Models across 11 ASEAN countries, with two core tasks: Cultural Visual Question Answering and Cultural Visual Grounding. The dataset comprises over 28,000 VQA items and 1,000 grounding annotations, evaluated using the SEA-LAVE metric that accounts for textual, cultural, and country-specific alignment. Findings show persistent Western-centric biases in state-of-the-art VLMs, with closed-source models outperforming open ones and performance varying significantly by country; region-specific prompts further improve cultural localization. The work underscores the need for regionally aware data, evaluation protocols, and model training to achieve equitable multimodal AI across diverse cultural contexts.

Abstract

Vision-Language Models (VLMs) excel in multimodal tasks but often exhibit Western-centric biases, limiting their effectiveness in culturally diverse regions like Southeast Asia (SEA). To address this, we introduce RICE-VL, a novel benchmark evaluating VLM cultural understanding across 11 ASEAN countries. RICE-VL includes over 28,000 human-curated Visual Question Answering (VQA) samples -- covering True or False, Fill-in-the-Blank, and open-ended formats -- and 1,000 image-bounding box pairs for Visual Grounding, annotated by culturally informed experts across 14 sub-ground categories. We propose SEA-LAVE, an extension of the LAVE metric, assessing textual accuracy, cultural alignment, and country identification. Evaluations of six open- and closed-source VLMs reveal significant performance gaps in low-resource countries and abstract cultural domains. The Visual Grounding task tests models' ability to localize culturally significant elements in complex scenes, probing spatial and contextual accuracy. RICE-VL exposes limitations in VLMs' cultural comprehension and highlights the need for inclusive model development to better serve diverse global populations.

Paper Structure

This paper contains 39 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An example instance from each task in RICE-VL Benchmark: i) culturalVQA; ii) cultural Visual Grounding.
  • Figure 2: Cultural understanding of various models assessed on culturalVQA tasks, when the model was prompted global context (left) and with SEA specific context (right)
  • Figure 3: CulturalVQA results for cultural understanding of various models, global and SEA specific prompt.
  • Figure 4: Visual Grounding results (Part 1): Comparing model predictions on region-specific cultural entities.
  • Figure 5: Visual Grounding results (Part 2): Comparing model predictions on region-specific cultural entities..
  • ...and 1 more figures