Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis
Anusha Kamath, Kanishk Singla, Rakesh Paul, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar
TL;DR
This work addresses the lack of high-quality Hindi evaluation benchmarks for instruction-tuned LLMs by introducing a five-dataset suite (IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, BFCL-Hi) created via a translate-and-verify, human-in-the-loop process. The authors benchmark open-source Hindi-supporting LLMs across these datasets, revealing that architecture and targeted language training drive performance more than sheer size, with distinct models excelling in different tasks. The proposed hybrid curation methodology provides a replicable path for developing culturally and linguistically appropriate benchmarks in other low-resource languages. The results offer actionable insights for improving Hindi LLMs in instruction-following, multi-turn dialogue, and function-calling scenarios, advancing more equitable multilingual AI systems.
Abstract
Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.
