Table of Contents
Fetching ...

BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan

TL;DR

BhashaBench V1 introduces a comprehensive, bilingual benchmark tailored to India’s critical knowledge systems, spanning Agriculture, Legal, Finance, and Ayurveda with 74,166 QA pairs drawn from authentic exams. The framework combines 90+ subdomains and 500+ topics to enable fine-grained, domain-aware evaluation in English and Hindi, and is underpinned by a rigorous data-processing pipeline that includes OCR, automated extraction, and manual validation. Across 29+ LLMs, the study reveals substantial domain- and language-specific gaps, with English content typically easier than Hindi and Ayurveda domains posing the greatest challenges, highlighting the need for India-centric model development and bilingual reasoning capabilities. By releasing all data, benchmarks, and tooling, the work provides a foundation for culturally aware, performance-sensitive AI systems applicable to India’s diverse linguistic and knowledge ecosystems.

Abstract

The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

TL;DR

BhashaBench V1 introduces a comprehensive, bilingual benchmark tailored to India’s critical knowledge systems, spanning Agriculture, Legal, Finance, and Ayurveda with 74,166 QA pairs drawn from authentic exams. The framework combines 90+ subdomains and 500+ topics to enable fine-grained, domain-aware evaluation in English and Hindi, and is underpinned by a rigorous data-processing pipeline that includes OCR, automated extraction, and manual validation. Across 29+ LLMs, the study reveals substantial domain- and language-specific gaps, with English content typically easier than Hindi and Ayurveda domains posing the greatest challenges, highlighting the need for India-centric model development and bilingual reasoning capabilities. By releasing all data, benchmarks, and tooling, the work provides a foundation for culturally aware, performance-sensitive AI systems applicable to India’s diverse linguistic and knowledge ecosystems.

Abstract

The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

Paper Structure

This paper contains 34 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview diagram and statistics of BhashaBench V1.
  • Figure 2: Comparative performance of small models ($\leq$4B) over BhashaBench V1.
  • Figure 3: Comparative performance analysis of the GPT model family on BhashaBench V1.
  • Figure 4: Comparison of representative LLMs’ scores across different domains and subdomains.
  • Figure 6: Manual quality assessment of BhashaBench V1 domain questions.