L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

Pritika Rohera; Chaitrali Ginimav; Akanksha Salunke; Gayatri Sawant; Raviraj Joshi

L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

Pritika Rohera, Chaitrali Ginimav, Akanksha Salunke, Gayatri Sawant, Raviraj Joshi

TL;DR

This paper presents the L3Cube-IndicQuest, a gold-standard factual question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages.

Abstract

Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, such as English. Currently, there is a lack of benchmark datasets specifically designed to evaluate the regional knowledge of LLMs in various Indic languages. In this paper, we present the L3Cube-IndicQuest, a gold-standard factual question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region. We aim for this dataset to serve as a benchmark, providing ground truth for evaluating the performance of LLMs in understanding and representing knowledge relevant to the Indian context. The IndicQuest can be used for both reference-based evaluation and LLM-as-a-judge evaluation. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp .

L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

TL;DR

Abstract

Paper Structure (10 sections, 5 figures, 1 table)

This paper contains 10 sections, 5 figures, 1 table.

Introduction
Related Work
Dataset Curation
Dataset Preparations
Data Statistics
Evaluation Methodology
Evaluation Metrics
Results and Observations
Conclusion and Future Work
Acknowledgements

Figures (5)

Figure 1: Language ranking based on average 'Overall' IndicQuest scores (Llama-3.1-405B-Instruct as a Judge) across languages, aggregating the scores for responses by the models. This ranking highlights the quality of multilingual LLMs for different Indic languages.
Figure 2: Dataset Curation Process
Figure 3: Dataset Overview
Figure 4: Average F1 Scores across Models obtained by aggregating the scores for all responses to Questions in IndicQuest given by these models. This ranking highlights model performance for Indic languages.
Figure 5: Average 'Overall' scores across Domains obtained by aggregating the scores for responses of all languages and models for the domain. This indicates model performance across various domains.

L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

TL;DR

Abstract

L3Cube-IndicQuest: A Benchmark Question Answering Dataset for Evaluating Knowledge of LLMs in Indic Context

Authors

TL;DR

Abstract

Table of Contents

Figures (5)