Table of Contents
Fetching ...

HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel Sama Supratheek Reddy, Divyam Gupta, Dasari Srikar, Krishna Teja Kuchimanchi, Rajiv Misra, Rohun Tripathi

TL;DR

HinTel-AlignBench presents a scalable framework to evaluate multilingual vision-language models in Hindi and Telugu with English-aligned samples, addressing major gaps in prior benchmarks such as auto-translation noise, narrow task domains, and limited cultural grounding. By combining back-translation with rigorous human verification, it assembles a diverse benchmark (~$4{,}000$ QA pairs per language) that includes adapted English datasets and native Indic sources (VAANI, JEE-Vision) for broad coverage. Across tested models, average performance regressed from English to Hindi by $8.3$ points and to Telugu by $5.5$ points, underscoring significant language-specific gaps in multimodal understanding. The work analyzes failure modes and outlines future directions to extend coverage to more Indic languages, improve cultural grounding, and refine evaluation methodologies, with implications for equitable, multilingual AI systems.

Abstract

With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.

HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples

TL;DR

HinTel-AlignBench presents a scalable framework to evaluate multilingual vision-language models in Hindi and Telugu with English-aligned samples, addressing major gaps in prior benchmarks such as auto-translation noise, narrow task domains, and limited cultural grounding. By combining back-translation with rigorous human verification, it assembles a diverse benchmark (~ QA pairs per language) that includes adapted English datasets and native Indic sources (VAANI, JEE-Vision) for broad coverage. Across tested models, average performance regressed from English to Hindi by points and to Telugu by points, underscoring significant language-specific gaps in multimodal understanding. The work analyzes failure modes and outlines future directions to extend coverage to more Indic languages, improve cultural grounding, and refine evaluation methodologies, with implications for equitable, multilingual AI systems.

Abstract

With nearly 1.5 billion people and more than 120 major languages, India represents one of the most diverse regions in the world. As multilingual Vision-Language Models (VLMs) gain prominence, robust evaluation methodologies are essential to drive progress toward equitable AI for low-resource languages. Current multilingual VLM evaluations suffer from four major limitations: reliance on unverified auto-translations, narrow task/domain coverage, limited sample sizes, and lack of cultural and natively sourced Question-Answering (QA). To address these gaps, we present a scalable framework to evaluate VLMs in Indian languages and compare it with performance in English. Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples. Our contributions are threefold: (1) a semi-automated dataset creation framework combining back-translation, filtering, and human verification; (2) the most comprehensive vision-language benchmark for Hindi and and Telugu, including adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native novel Indic datasets (JEE for STEM, VAANI for cultural grounding) with approximately 4,000 QA pairs per language; and (3) a detailed performance analysis of various State-of-the-Art (SOTA) open-weight and closed-source VLMs. We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu. We categorize common failure modes to highlight concrete areas of improvement in multilingual multimodal understanding.

Paper Structure

This paper contains 26 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Average Performance of GPT-4.1 and Gemini-2.5-Flash models on English, Hindi and Telugu Languages on data-parallel visual question answering samples. Overall, performance regresses from English to Hindi by 8.3 points and regresses from English to Telugu by 5.5 points.
  • Figure 2: Dataset generation pipeline for [A] VQAv2, RealWorldQA & CLEVR-Math [B] VAANI-H & VAANI-T [C] JEE-H & JEE-T. We use back translation and text only LLMs to reduce human involvement in QA generation.
  • Figure 3: Qualitative Examples for different domains in our dataset. More images are shown in the appendix
  • Figure 4: Average performance regression from English to Indic Language per domain across all models. In all but one scenario, there is a regression in performance from English to the target Indian language
  • Figure 5: Qualitative Examples from VQAv2, RealWorldQA, CLEVR-Math, VAANI-H, JEE-T, VAANI-T, JEE-H