Table of Contents
Fetching ...

Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

Anusha Kamath, Kanishk Singla, Rakesh Paul, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

TL;DR

This work addresses the lack of high-quality Hindi evaluation benchmarks for instruction-tuned LLMs by introducing a five-dataset suite (IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, BFCL-Hi) created via a translate-and-verify, human-in-the-loop process. The authors benchmark open-source Hindi-supporting LLMs across these datasets, revealing that architecture and targeted language training drive performance more than sheer size, with distinct models excelling in different tasks. The proposed hybrid curation methodology provides a replicable path for developing culturally and linguistically appropriate benchmarks in other low-resource languages. The results offer actionable insights for improving Hindi LLMs in instruction-following, multi-turn dialogue, and function-calling scenarios, advancing more equitable multilingual AI systems.

Abstract

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

TL;DR

This work addresses the lack of high-quality Hindi evaluation benchmarks for instruction-tuned LLMs by introducing a five-dataset suite (IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, BFCL-Hi) created via a translate-and-verify, human-in-the-loop process. The authors benchmark open-source Hindi-supporting LLMs across these datasets, revealing that architecture and targeted language training drive performance more than sheer size, with distinct models excelling in different tasks. The proposed hybrid curation methodology provides a replicable path for developing culturally and linguistically appropriate benchmarks in other low-resource languages. The results offer actionable insights for improving Hindi LLMs in instruction-following, multi-turn dialogue, and function-calling scenarios, advancing more equitable multilingual AI systems.

Abstract

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

Paper Structure

This paper contains 16 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Distribution of samples by Indian cultural themes in the IFEval-Hi dataset.
  • Figure 2: Distribution of verifiable instruction types within the IFEval-Hi dataset.
  • Figure 3: Category distribution in MT-Bench-Hi, adapted with Indian cultural themes to increase focus on culturally relevant instructions.
  • Figure 4: Representative examples from five Hindi evaluation datasets curated in this study.
  • Figure 5: A sample GSM8K question highlighting a translation mistake in Hindi (red), the corrected version (green), and the corresponding English line (yellow), showcasing the process of identifying and fixing language conversion errors manually.
  • ...and 9 more figures