Table of Contents
Fetching ...

UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic

Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram

TL;DR

This work tackles the scarcity of culturally grounded multilingual AI for Indian languages by introducing Updesh, a 9.5M-sized synthetic instruction-following dataset spanning 13 Indic languages plus English. Built with a bottom-up approach grounded in language-specific Wikipedia content, Updesh complements traditional translation-first pipelines and comprises two complementary subsets: Reasoning (translated from OrcaAgent-Instruct and OrcaMath) and Generative (culturally grounded synthesis from Wikipedia). Comprehensive automated and human evaluations, along with ablations and cultural assessments, demonstrate that models fine-tuned on Updesh achieve superior NLG and competitive NLU across languages, with robust cross-lingual transfer to unseen languages. The authors provide extensive reproducibility artifacts, including code and pipelines, and offer design principles for future multilingual and multicultural data generation. Overall, Updesh advances practical multilingual AI by demonstrating the value of culturally aware, context-rich synthetic data for instruction-following models in underrepresented languages.

Abstract

Developing culturally grounded multilingual AI systems remains challenging, particularly for low-resource languages. While synthetic data offers promise, its effectiveness in multilingual and multicultural contexts is underexplored. We investigate bottom-up synthetic data generation using large open-source LLMs (>= 235B parameters) grounded in language-specific Wikipedia content, complementing dominant top-down translation-based approaches from English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages and English, encompassing diverse reasoning and generative tasks. Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality. Downstream evaluations performed by fine-tuning models on various datasets and assessing performance across 13 diverse multilingual datasets and model comparative evaluations, demonstrate that models trained on Updesh consistently obtain significant improvements on NLU, NLG evaluations. Finally, through ablation studies and cultural evaluations, we show that context-aware, culturally grounded data generation is essential for effective multilingual AI development .

UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic

TL;DR

This work tackles the scarcity of culturally grounded multilingual AI for Indian languages by introducing Updesh, a 9.5M-sized synthetic instruction-following dataset spanning 13 Indic languages plus English. Built with a bottom-up approach grounded in language-specific Wikipedia content, Updesh complements traditional translation-first pipelines and comprises two complementary subsets: Reasoning (translated from OrcaAgent-Instruct and OrcaMath) and Generative (culturally grounded synthesis from Wikipedia). Comprehensive automated and human evaluations, along with ablations and cultural assessments, demonstrate that models fine-tuned on Updesh achieve superior NLG and competitive NLU across languages, with robust cross-lingual transfer to unseen languages. The authors provide extensive reproducibility artifacts, including code and pipelines, and offer design principles for future multilingual and multicultural data generation. Overall, Updesh advances practical multilingual AI by demonstrating the value of culturally aware, context-rich synthetic data for instruction-following models in underrepresented languages.

Abstract

Developing culturally grounded multilingual AI systems remains challenging, particularly for low-resource languages. While synthetic data offers promise, its effectiveness in multilingual and multicultural contexts is underexplored. We investigate bottom-up synthetic data generation using large open-source LLMs (>= 235B parameters) grounded in language-specific Wikipedia content, complementing dominant top-down translation-based approaches from English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages and English, encompassing diverse reasoning and generative tasks. Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality. Downstream evaluations performed by fine-tuning models on various datasets and assessing performance across 13 diverse multilingual datasets and model comparative evaluations, demonstrate that models trained on Updesh consistently obtain significant improvements on NLU, NLG evaluations. Finally, through ablation studies and cultural evaluations, we show that context-aware, culturally grounded data generation is essential for effective multilingual AI development .

Paper Structure

This paper contains 53 sections, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Overview of the data generation pipeline for the Updesh Dataset.
  • Figure 2: Human LLM-judge agreement across evaluation metrics, revealing differences across dimensions.
  • Figure 3: Evaluation plots for models finetuned on Updesh vs existing datasets
  • Figure 4: Model Performance Landscape: NLU vs. NLG vs. Win Counts. The horizontal axis represents the average NLU (accuracy between 0-100), while the vertical axis represents the average NLG score (ChrF between 0-100). The size of each bubble corresponds to the number of specific datasets (12 tasks evaluated) where that the model outperformed all others. Updesh model (green) demonstrates the most dominant position with high generation scores and the largest number of task wins across both Llama and Phi settings.
  • Figure 5: NLG performance across 16 out-of-domain Indic languages on Flores.Updesh (red) achieves the highest average scores on both Llama-3-8B and Phi-4 architectures, outperforming standard baselines (Zero-shot) and comparable instruction-tuned models (Bactrian, Aya, IndicAlign).
  • ...and 8 more figures