Table of Contents
Fetching ...

Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali

Sharad Duwal, Suraj Prasai, Suresh Manandhar

TL;DR

This paper tackles adapting a pre-trained LLM to Nepali under resource constraints by employing domain-adaptive continual pretraining with synthetic parallel data and a $4$-bit QLoRA setup on Llama 3 $8$B. It introduces a two-stage training pipeline—translation-based pretraining and bilingual next-token prediction—followed by QLoRA finetuning on mixed Nepali/English instructions. Results show the adapted model gains Nepali-generation capabilities but exhibits some English-forgetting, with few-shot prompting revealing latent retention, and attention-head analyses indicating cross-lingual alignment. The work demonstrates that domain-adaptive pretraining can make LLMs more accessible to low-resource languages when data and compute are limited, offering a practical path for broader multilingual applicability.

Abstract

Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots during evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer-head self-attention heatmaps to establish dependency resolution abilities of the final model in Nepali.

Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali

TL;DR

This paper tackles adapting a pre-trained LLM to Nepali under resource constraints by employing domain-adaptive continual pretraining with synthetic parallel data and a -bit QLoRA setup on Llama 3 B. It introduces a two-stage training pipeline—translation-based pretraining and bilingual next-token prediction—followed by QLoRA finetuning on mixed Nepali/English instructions. Results show the adapted model gains Nepali-generation capabilities but exhibits some English-forgetting, with few-shot prompting revealing latent retention, and attention-head analyses indicating cross-lingual alignment. The work demonstrates that domain-adaptive pretraining can make LLMs more accessible to low-resource languages when data and compute are limited, offering a practical path for broader multilingual applicability.

Abstract

Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it was not originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots during evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer-head self-attention heatmaps to establish dependency resolution abilities of the final model in Nepali.

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: GPT4o scores for Nepali answers generated by the base model (Llama 3 8B 4-bit) and our model on five attributes: correctness, grammar, usability, hallucination and overall quality. Empty generations from the models are scored 0 on all attributes. c) and d) are the distribution of scores among the attributes with medians and outliers.
  • Figure 2: Layer-head heatmaps visualizing attention from adjectives to their respective nouns in English (a,c) and Nepali (b,d) for the base model (a,b) our model (c,d). Rows are layers and columns are attention heads. From b) and d), we can see our model has learned to attend to Nepali adjectives the way the base model attends to English ones in a).