Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal
TL;DR
This work tackles the scarcity of culturally grounded, multilingual post-training data for Indian languages by introducing a scalable human-in-the-loop pipeline that couples translation with synthetic expansion. It yields two datasets, Pragyaan-IT (22.5K) for instruction tuning and Pragyaan-Align (100K) for preference-based alignment, spanning 10 Indian languages and 56 sub-categories across 13 broad categories. The approach emphasizes task diversity, multi-turn dialogues, instruction fidelity, safety, and Indian cultural context, addressing limitations of direct translations and purely synthetic data. A pilot downstream evaluation on the Updesh dataset demonstrates promising alignment potential across languages and categories. The workflow is adaptable to other multilingual settings and provides a detailed blueprint for culturally inclusive post-training data curation.
Abstract
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
