Table of Contents
Fetching ...

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal

TL;DR

This work tackles the scarcity of culturally grounded, multilingual post-training data for Indian languages by introducing a scalable human-in-the-loop pipeline that couples translation with synthetic expansion. It yields two datasets, Pragyaan-IT (22.5K) for instruction tuning and Pragyaan-Align (100K) for preference-based alignment, spanning 10 Indian languages and 56 sub-categories across 13 broad categories. The approach emphasizes task diversity, multi-turn dialogues, instruction fidelity, safety, and Indian cultural context, addressing limitations of direct translations and purely synthetic data. A pilot downstream evaluation on the Updesh dataset demonstrates promising alignment potential across languages and categories. The workflow is adaptable to other multilingual settings and provides a detailed blueprint for culturally inclusive post-training data curation.

Abstract

The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

TL;DR

This work tackles the scarcity of culturally grounded, multilingual post-training data for Indian languages by introducing a scalable human-in-the-loop pipeline that couples translation with synthetic expansion. It yields two datasets, Pragyaan-IT (22.5K) for instruction tuning and Pragyaan-Align (100K) for preference-based alignment, spanning 10 Indian languages and 56 sub-categories across 13 broad categories. The approach emphasizes task diversity, multi-turn dialogues, instruction fidelity, safety, and Indian cultural context, addressing limitations of direct translations and purely synthetic data. A pilot downstream evaluation on the Updesh dataset demonstrates promising alignment potential across languages and categories. The workflow is adaptable to other multilingual settings and provides a detailed blueprint for culturally inclusive post-training data curation.

Abstract

The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.

Paper Structure

This paper contains 32 sections, 30 figures, 6 tables.

Figures (30)

  • Figure 1: Workflow for building Indian language post-training data: English prompts are either translated or expanded via modified self-instruct pipeline to generate synthetic prompts. In both cases, responses are then produced with an LLM, translated into one of the 10 Indian languages, and manually refined (Section \ref{['sec:method']}).
  • Figure 2: Distribution of Pragyaan-IT (Instruction-Tuning) data across languages (left) and categories (right).
  • Figure 3: Average word counts of Pragyaan-Align alignment data across languages (left) and categories (right).
  • Figure 4: Average word counts of Pragyaan-IT data across languages (left) and categories (right).
  • Figure 5: Win rates of the Krutrim-2-12B (left) and Llama-3-8B (right) models after DPO, compared against their respective pre-DPO versions.
  • ...and 25 more figures