UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

Layba Fiaz; Munief Hassan Tahir; Sana Shams; Sarmad Hussain

UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

Layba Fiaz, Munief Hassan Tahir, Sana Shams, Sarmad Hussain

TL;DR

UrduLLaMA 1.0 addresses the data-scarce, performance gap for Urdu in LLMs by deploying a four-stage pipeline: continual pretraining on 128M Urdu tokens, instruction tuning with 41k Urdu instructions, MT-focused fine-tuning on 62,970 in-house translations, and comprehensive preprocessing to ensure Urdu-centric data quality. The model, built on Llama-3.1-8B-Instruct and enhanced with LoRA for efficient instruction alignment, shows BLEU gains over the base model across multiple MT benchmarks and is corroborated by expert human judgments. The work demonstrates that targeted adaptation under limited data and compute can yield substantial improvements for a low-resource language, and it establishes UrduLLaMA 1.0 as a new benchmark for Urdu LLMs. The study also discusses limitations, including partial data coverage and lack of detoxification, highlighting future directions for broader data collection and safety controls.

Abstract

Multilingual Large Language Models (LLMs) often provide suboptimal performance on low-resource languages like Urdu. This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture and continually pre-trained on 128 million Urdu tokens, capturing the rich diversity of the language. To enhance instruction-following and translation capabilities, we leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs. Evaluation across three machine translation datasets demonstrates significant performance improvements compared to state-of-the-art (SOTA) models, establishing a new benchmark for Urdu LLMs. These findings underscore the potential of targeted adaptation strategies with limited data and computational resources to address the unique challenges of low-resource languages.

UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

TL;DR

Abstract

UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)