Table of Contents
Fetching ...

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj, Neel Prabhanjan Rachamalla, Ashish Kulkarni, Gautam Rajeev, Jay Piplodiya, Arul Menezes, Shaharukh Khan, Souvik Rana, Manya Sah, Chandra Khatri, Shubham Agarwal

TL;DR

The paper tackles the core challenge of limited, high-quality multilingual data for pretraining LLMs on Indic languages. It introduces BhashaKritika, a 540B-token synthetic corpus generated via five complementary techniques (document-grounded, persona-based, math/reasoning, topic-aware RAG, and translation) and powered by a modular quality-control pipeline that includes language detection, heuristic and perplexity filters, automated quality classification, and bias mitigation. Key contributions include a detailed evaluation of generation strategies across languages and prompts, demonstration that high-quality synthetic data can converge faster and rival web data for Indic benchmarks, and a pathway for applying synthetic-data methods in low-resource settings. The work offers practical guidance for building scalable, language-sensitive pretraining corpora for Indic languages, with broad implications for culturally inclusive NLP and future LLM development in multilingual, morphologically rich contexts.

Abstract

In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

TL;DR

The paper tackles the core challenge of limited, high-quality multilingual data for pretraining LLMs on Indic languages. It introduces BhashaKritika, a 540B-token synthetic corpus generated via five complementary techniques (document-grounded, persona-based, math/reasoning, topic-aware RAG, and translation) and powered by a modular quality-control pipeline that includes language detection, heuristic and perplexity filters, automated quality classification, and bias mitigation. Key contributions include a detailed evaluation of generation strategies across languages and prompts, demonstration that high-quality synthetic data can converge faster and rival web data for Indic benchmarks, and a pathway for applying synthetic-data methods in low-resource settings. The work offers practical guidance for building scalable, language-sensitive pretraining corpora for Indic languages, with broad implications for culturally inclusive NLP and future LLM development in multilingual, morphologically rich contexts.

Abstract

In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

Paper Structure

This paper contains 29 sections, 3 equations, 24 figures, 36 tables.

Figures (24)

  • Figure 1: Overview of Synthetic data generation techniques (Section \ref{['sec:syn-gen']}) followed by Quality Evaluation (Section \ref{['sec:quality-eval']}). We follow 5 approaches across 10 Indian languages using a pool of Multilingual LLMs to generate a large scale BhashaKritika corpora.
  • Figure 2: Distribution of languages (left) and topics (right) in BhashaKritika. We show the broad 12 topics for brevity with a more fine-grained distribution in Table \ref{['tab:broad_topic_distribution']} (Appendix).
  • Figure 3: We annealed LLaMA-3.2 1B pretrained model on $50$B tokens of Web vs. our synthetic data - BhashaKritika. We observe faster convergence on BhashaKritika.
  • Figure 4: Loss curves for simulated low resource setting: LLaMA-3.2 1B is pretrained from scratch on $15$B Indic Web tokens ($10$K training steps) followed by continual training on - (1) same Web data; (2) BhashaKritika data
  • Figure 5: Manually curated bias words (target and attribute sets) for caste aspect.
  • ...and 19 more figures