UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic
Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram
TL;DR
This work tackles the scarcity of culturally grounded multilingual AI for Indian languages by introducing Updesh, a 9.5M-sized synthetic instruction-following dataset spanning 13 Indic languages plus English. Built with a bottom-up approach grounded in language-specific Wikipedia content, Updesh complements traditional translation-first pipelines and comprises two complementary subsets: Reasoning (translated from OrcaAgent-Instruct and OrcaMath) and Generative (culturally grounded synthesis from Wikipedia). Comprehensive automated and human evaluations, along with ablations and cultural assessments, demonstrate that models fine-tuned on Updesh achieve superior NLG and competitive NLU across languages, with robust cross-lingual transfer to unseen languages. The authors provide extensive reproducibility artifacts, including code and pipelines, and offer design principles for future multilingual and multicultural data generation. Overall, Updesh advances practical multilingual AI by demonstrating the value of culturally aware, context-rich synthetic data for instruction-following models in underrepresented languages.
Abstract
Developing culturally grounded multilingual AI systems remains challenging, particularly for low-resource languages. While synthetic data offers promise, its effectiveness in multilingual and multicultural contexts is underexplored. We investigate bottom-up synthetic data generation using large open-source LLMs (>= 235B parameters) grounded in language-specific Wikipedia content, complementing dominant top-down translation-based approaches from English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages and English, encompassing diverse reasoning and generative tasks. Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality. Downstream evaluations performed by fine-tuning models on various datasets and assessing performance across 13 diverse multilingual datasets and model comparative evaluations, demonstrate that models trained on Updesh consistently obtain significant improvements on NLU, NLG evaluations. Finally, through ablation studies and cultural evaluations, we show that context-aware, culturally grounded data generation is essential for effective multilingual AI development .
