Table of Contents
Fetching ...

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal

TL;DR

IndicVisionBench addresses the lack of representation for India's cultural and linguistic diversity in vision-language model evaluation by creating a large-scale, Indian-centric benchmark. It spans English and 10 Indic languages across OCR, MMT, and VQA, with 5K images and 37K+ QA across 13 topics, plus a parallel cross-lingual annotation corpus. The authors evaluate 8 models (proprietary and open-weight) and find substantial performance gaps, especially for low-resource languages and culturally grounded content, with clear differences between closed vs open models. The benchmark provides a reproducible framework to study cultural biases and multilingual understanding in VLMs, enabling more inclusive multimodal AI research.

Abstract

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

TL;DR

IndicVisionBench addresses the lack of representation for India's cultural and linguistic diversity in vision-language model evaluation by creating a large-scale, Indian-centric benchmark. It spans English and 10 Indic languages across OCR, MMT, and VQA, with 5K images and 37K+ QA across 13 topics, plus a parallel cross-lingual annotation corpus. The authors evaluate 8 models (proprietary and open-weight) and find substantial performance gaps, especially for low-resource languages and culturally grounded content, with clear differences between closed vs open models. The benchmark provides a reproducible framework to study cultural biases and multilingual understanding in VLMs, enabling more inclusive multimodal AI research.

Abstract

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

Paper Structure

This paper contains 31 sections, 22 figures, 16 tables.

Figures (22)

  • Figure 1: IndicVisionBench (IVB) pipeline and 3 tracks. Top panel illustrates our image collection pipeline for 10 Indian languages, showing the number of images at each step, with human quality checks applied throughout. We also present sample outputs for the three tracks: VQA (Visual Question Answering) in English, MMT (Multimodal Machine Translation) in Telugu, and OCR (Optical Character Recognition) in Punjabi. Further details are provided in Section \ref{['sec:benchmark']}.
  • Figure 2: Examples from IndicVisionBench-VQA. Illustrative samples from different regions are shown on the left. The map on the right depicts the regional distribution of images across India, with counts per State/UT. Further details are provided in Section \ref{['appendix:dataset']} of the Appendix.
  • Figure 3: Data analysis on IndicVisionBench. Distribution of VQA questions by category (a) and by language excluding English (b); average word counts for questions (c) and answers (d). For MMT (e) shows caption word counts in Hindi; and for OCR average words per language (f).
  • Figure 4: Model performances on IndicVisionBench-VQA-Parallel. Average scores across languages for the three open-ended (long and short) questions (on left) and scores across languages for the structured tasks (True/False and MCQ) on the right.
  • Figure 5: Performance across topics in IndicVisionBench-VQA. Distribution of categories of questions (on left) and model performances averaged over the two short and a long answer open-ended questions (on right). Gemini-2.5 shows comparable performance across all topics.
  • ...and 17 more figures