Table of Contents
Fetching ...

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

Raviraj Joshi, Kanishk Singla, Anusha Kamath, Raunak Kalani, Rakesh Paul, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar, Eileen Long

TL;DR

This work tackles the limited Hindi performance in multilingual LLMs by performing continued pre-training on a balanced Hindi–English corpus and by generating a synthetic Hindi dataset through translation and transliteration. The authors introduce Nemotron-Mini-Hindi-4B, a bilingual SLM, and its instruct variant, trained on 400B tokens and aligned via SFT and Direct Preference Optimization. Empirical results show state-of-the-art Hindi performance across Indic benchmarks while maintaining competitive English capabilities, with ablations highlighting the critical role of Hindi pre-training for factual accuracy and cross-lingual transfer. The study demonstrates that targeted pre-training and synthetic-data augmentation can substantially improve low-resource language capabilities in multilingual LLMs, with practical implications for Hinglish support and Indic language understanding.

Abstract

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall factual accuracy. We perform an ablation study to highlight the impact of Hindi pre-training, showing significant improvements in Hindi chat capabilities and factual accuracy, which cannot be achieved through Hindi alignment alone.

Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus

TL;DR

This work tackles the limited Hindi performance in multilingual LLMs by performing continued pre-training on a balanced Hindi–English corpus and by generating a synthetic Hindi dataset through translation and transliteration. The authors introduce Nemotron-Mini-Hindi-4B, a bilingual SLM, and its instruct variant, trained on 400B tokens and aligned via SFT and Direct Preference Optimization. Empirical results show state-of-the-art Hindi performance across Indic benchmarks while maintaining competitive English capabilities, with ablations highlighting the critical role of Hindi pre-training for factual accuracy and cross-lingual transfer. The study demonstrates that targeted pre-training and synthetic-data augmentation can substantially improve low-resource language capabilities in multilingual LLMs, with practical implications for Hinglish support and Indic language understanding.

Abstract

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model's overall factual accuracy. We perform an ablation study to highlight the impact of Hindi pre-training, showing significant improvements in Hindi chat capabilities and factual accuracy, which cannot be achieved through Hindi alignment alone.

Paper Structure

This paper contains 9 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Adaptation of multilingual Nemotron-Mini-4B model (also known as Minitron-4B).
  • Figure 2: Comparison of different instruct models on various parameters using SubjectiveEval.
  • Figure 3: Comparison of different instruct models on various parameters using IndicQuest-Hi.
  • Figure 4: Comparison of different instruct models on Factuality score of IndicQuest. The ground truth answers from IndicQuest are provided as a reference to GPT4 for better scoring. The Nemotron-Mini-Hindi-4B provides comparable scores for Hindi and English whereas other models provide better factuality for English.
  • Figure 5: Results of human evaluation on translated MT-Bench. A win indicates Nemotron-Mini-Hindi-4B model is preferred.