IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

Harman Singh; Nitish Gupta; Shikhar Bharadwaj; Dinesh Tewari; Partha Talukdar

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, Partha Talukdar

TL;DR

IndicGenBench is the largest generation-oriented benchmark for 29 Indic languages, extending existing resources with human-translated, multi-task data to evaluate cross-lingual summarization, translation, and QA. The study comprehensively benchmarks diverse LLMs (including PaLM-2, GPT-4, LLaMA, Gemma, BLOOMZ) across one-shot, few-shot, and fine-tuning regimes, revealing a substantial English–Indic performance gap and strong model-size effects. Key findings show that prompting language, resource level of the target language, and tokenizer properties significantly impact generation quality, with Hindi often serving as an effective transfer language. The work also provides qualitative analyses and highlights tokenization-related bottlenecks, suggesting directions for improved multilingual generation and evaluation for Indic languages.

Abstract

As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at www.github.com/google-research-datasets/indic-gen-bench

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

TL;DR

Abstract

Paper Structure (27 sections, 5 figures, 22 tables)

This paper contains 27 sections, 5 figures, 22 tables.

Introduction
IndicGenBench
Experiments and Analysis
Evaluation Metrics
Comparison of LLMs on IndicGenBench
Performance across language categories
In-context learning on IndicGenBench
Transfer from high-resource languages
Fine-tuning LLMs on IndicGenBench and Comparison with In-Context Learning
Analyzing Tokenizer across Indic languages
Qualitative Analysis
Generation in a related language
Hallucination and Missing Information
Related Work
Conclusion
...and 12 more sections

Figures (5)

Figure 1: Performance of state-of-the-art LLMs on different tasks in IndicGenBench. We observe a significant performance gap between English and Indic languages across LLMs.
Figure 2: Tokenizer fertility for different languages using OpenAI's Byte Pair Encoding. We note that mid-low resource languages suffer from high token fertility. (Section \ref{['ssec:tokenization-issues']})
Figure 3: Percentage of the XQuAD-In test set in few-shot learning setting that fits in a 1920 token context. High token fertility of mid to low resource languages results in being able to fit much fewer in-context examples compared to higher resourced ones. (§\ref{['ssec:tokenization-issues']})
Figure 4: Example predictions from PaLM-2-L model on the CrossSum-In task highlighting issues with hallucinations in model predictions.
Figure 5: Example predictions from PaLM-2-L model on the Flores-In task highlighting issues of (a) producing words in the wrong (higher-resourced) language or words with the wrong inflection, and (b) outputting the incorrect translations for polysemous words in English or missing crucial information from the generated translation.

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

TL;DR

Abstract

IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (5)