Table of Contents
Fetching ...

Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq

TL;DR

The paper tackles the challenge of building high-performing LLMs for Urdu under data scarcity and cultural nuance constraints by introducing Alif-1.0-8B-Instruct, a multilingual Urdu–English model trained on a high-quality synthetic Urdu-Instruct dataset generated via a modified self-instruct pipeline. It combines continued pre-training on Urdu data, instruction-focused fine-tuning with task-diverse synthetic and translated data, and replay data to mitigate forgetting, all using efficient LoRA-based training on a sub-$100 budget. Empirical results show Alif outperforms prominent multilingual LLMs on Urdu benchmarks and retains strong English capabilities, while quantization studies demonstrate deployment-friendly memory-accuracy trade-offs. The work provides a scalable, culturally-aware approach to Urdu NLP and offers resources and methodology that can be extended to other low-resource languages.

Abstract

Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

TL;DR

The paper tackles the challenge of building high-performing LLMs for Urdu under data scarcity and cultural nuance constraints by introducing Alif-1.0-8B-Instruct, a multilingual Urdu–English model trained on a high-quality synthetic Urdu-Instruct dataset generated via a modified self-instruct pipeline. It combines continued pre-training on Urdu data, instruction-focused fine-tuning with task-diverse synthetic and translated data, and replay data to mitigate forgetting, all using efficient LoRA-based training on a sub-$100 budget. Empirical results show Alif outperforms prominent multilingual LLMs on Urdu benchmarks and retains strong English capabilities, while quantization studies demonstrate deployment-friendly memory-accuracy trade-offs. The work provides a scalable, culturally-aware approach to Urdu NLP and offers resources and methodology that can be extended to other low-resource languages.

Abstract

Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

Paper Structure

This paper contains 25 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 2: Comparison of Alif-1.0-8B-Instruct and Meta-Llama-3.1-8B-Instruct on Urdu-translated benchmarks.
  • Figure 3: Perplexity comparison across GGUF quantization formats for Alif-1.0-8B-Instruct.
  • Figure 4: Memory footprint of different GGUF quantization formats for Alif-1.0-8B-Instruct.
  • Figure 5: Annotator Demographics by Province in Pakistan.
  • Figure 6: Overview of the Urdu-Instruct dataset refinement guidelines.
  • ...and 1 more figures