Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali
Sharad Duwal, Suraj Prasai, Suresh Manandhar
TL;DR
This paper tackles adapting a pre-trained LLM to Nepali under resource constraints by employing domain-adaptive continual pretraining with synthetic parallel data and a $4$-bit QLoRA setup on Llama 3 $8$B. It introduces a two-stage training pipeline—translation-based pretraining and bilingual next-token prediction—followed by QLoRA finetuning on mixed Nepali/English instructions. Results show the adapted model gains Nepali-generation capabilities but exhibits some English-forgetting, with few-shot prompting revealing latent retention, and attention-head analyses indicating cross-lingual alignment. The work demonstrates that domain-adaptive pretraining can make LLMs more accessible to low-resource languages when data and compute are limited, offering a practical path for broader multilingual applicability.
Abstract
Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots during evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer-head self-attention heatmaps to establish dependency resolution abilities of the final model in Nepali.
