IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Ali Faraz; Akash; Shaharukh Khan; Raja Kolla; Akshat Patidar; Suranjan Goswami; Abhinav Ravi; Chandra Khatri; Shubham Agarwal

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal

TL;DR

IndicVisionBench addresses the lack of representation for India's cultural and linguistic diversity in vision-language model evaluation by creating a large-scale, Indian-centric benchmark. It spans English and 10 Indic languages across OCR, MMT, and VQA, with 5K images and 37K+ QA across 13 topics, plus a parallel cross-lingual annotation corpus. The authors evaluate 8 models (proprietary and open-weight) and find substantial performance gaps, especially for low-resource languages and culturally grounded content, with clear differences between closed vs open models. The benchmark provides a reproducible framework to study cultural biases and multilingual understanding in VLMs, enabling more inclusive multimodal AI research.

Abstract

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

TL;DR

Abstract

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)