Airavata: Introducing Hindi Instruction-tuned LLM

Jay Gala; Thanmay Jayakumar; Jaavid Aktar Husain; Aswanth Kumar M; Mohammed Safi Ur Rahman Khan; Diptesh Kanojia; Ratish Puduppully; Mitesh M. Khapra; Raj Dabre; Rudra Murthy; Anoop Kunchukuttan

Airavata: Introducing Hindi Instruction-tuned LLM

Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan

TL;DR

Airavata addresses the underrepresentation of Indian languages in LLMs by releasing an open-source Hindi instruction-tuned model derived from OpenHathi. The approach relies on translating high-quality English instruction data into Hindi with IndicTrans2 to build a sizable, filtered training set using LoRA fine-tuning. An evaluation framework spanning native Hindi benchmarks, translated English benchmarks, human judgments, and toxicity checks demonstrates improvements in Hindi NLU and competitive, though uneven, Hindi NLG performance. The work provides IndicInstruct data and evaluation resources to spur broader Indic-language LLM research and outlines future directions for larger datasets and cross-lingual alignment.

Abstract

We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additionally, we present evaluation benchmarks and a framework for assessing LLM performance across tasks in Hindi. Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages. You can access all artifacts at https://ai4bharat.github.io/airavata.

Airavata: Introducing Hindi Instruction-tuned LLM

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 12 figures, 9 tables)

This paper contains 15 sections, 1 equation, 12 figures, 9 tables.

Introduction
Instruction Tuning Dataset Creation
Supervised Fine-tuning
Full vs. LoRA finetuning
Model Selection
Evaluation on NLP Benchmarks
Results
Human Evaluation
Toxicity and Misinformation
Resources
Summary and Future Outlook
Limitations
Examples
Examples where the Airavata model generates good output
Examples where Airavata output has errors

Figures (12)

Figure 1: Image Courtesy: DALL-E 3 dalle3.
Figure 2: Ablation experiment to understand the performance gaps between Full fine-tuning and LoRA fine-tuning across a mix of English and Hindi NLU tasks.
Figure 3: Average satisfaction scores for various models based on a Likert scale between 1 and 5 reported by Human annotators.
Figure 4: Human evaluation scores for assessing the instruction following and content generation abilities of the models based on the rubrics described in \ref{['tab:evaluation-rubrics']}.
Figure 5: Fine-grained human evaluation of content generation abilities of the models described in \ref{['tab:abilities-desc']}.
...and 7 more figures

Airavata: Introducing Hindi Instruction-tuned LLM

TL;DR

Abstract

Airavata: Introducing Hindi Instruction-tuned LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (12)