IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Thanmay Jayakumar; Mohammed Safi Ur Rahman Khan; Raj Dabre; Ratish Puduppully; Anoop Kunchukuttan

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan

TL;DR

This work introduces IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions, and conducts a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models.

Abstract

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks -- and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

TL;DR

Abstract

Paper Structure (33 sections, 14 figures, 9 tables)

This paper contains 33 sections, 14 figures, 9 tables.

Introduction
Related Work
Dataset Construction
IndicIFEval-Trans
Preprocessing
Keyword Extraction + Individual Keyword Translation
Pre-translation Insertion
Full Translation
Automatic Verification
IndicIFEval-Ground
Human Verification
Evaluation
Models
Metrics
Evaluation
...and 18 more sections

Figures (14)

Figure 1: Overview of the dataset construction pipeline
Figure 2: Effect of Increasing Model Parameters. The higher the better -- The Gemma-3-27B-IT model performs the best, followed by Llama-4-Scout-17B-16E.
Figure 3: Indic - English disparity comparing Language vs Model Family. The lower the better -- the Aya family exhibits the highest gap in general, but performs moderately well for Hindi, Bengali, Urdu, and Tamil.
Figure 4: Indic - English gap comparing Model Family vs Instruction Category. Lower the better -- the Llama family and Gemma family perform the best and worst respectively on Language constraint within the family.
Figure 5: Indic - English disparity comparing Instruction Category vs Language. Lower the better -- Hindi ('hi') is visually the lightest (lowest $\Delta$) across all categories horizontally and darkest (highest $\Delta$) across all languages vertically.
...and 9 more figures

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

TL;DR

Abstract

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (14)