Table of Contents
Fetching ...

Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning

Amir Mohammad Akhlaghi, Amirhossein Shabani, Mostafa Abdolmaleki, Saeed Reza Kheradpisheh

TL;DR

Persian-Phi demonstrates that high-quality cross-lingual capabilities for a low-resource language can be achieved with a compact 3.8B-parameter model by starting from an English monolingual base and applying a curriculum-driven adaptation pipeline. The method combines tokenizer augmentation, an embedding warm-up with bilingual Tiny Stories, continual pretraining on carefully filtered Persian corpora, and LoRA-based supervised fine-tuning to balance Persian fluency with English retention. Results on the Open Persian LLM Leaderboard show competitive performance against larger multilingual baselines, suggesting a scalable, resource-efficient pathway for expanding LLM support to underrepresented languages. The work provides a practical blueprint for similar adaptations to other languages, emphasizing data quality, efficient parameter updates, and careful alignment of cross-lingual representations.

Abstract

The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.

Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning

TL;DR

Persian-Phi demonstrates that high-quality cross-lingual capabilities for a low-resource language can be achieved with a compact 3.8B-parameter model by starting from an English monolingual base and applying a curriculum-driven adaptation pipeline. The method combines tokenizer augmentation, an embedding warm-up with bilingual Tiny Stories, continual pretraining on carefully filtered Persian corpora, and LoRA-based supervised fine-tuning to balance Persian fluency with English retention. Results on the Open Persian LLM Leaderboard show competitive performance against larger multilingual baselines, suggesting a scalable, resource-efficient pathway for expanding LLM support to underrepresented languages. The work provides a practical blueprint for similar adaptations to other languages, emphasizing data quality, efficient parameter updates, and careful alignment of cross-lingual representations.

Abstract

The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft Phi-3 Mini -- originally a monolingual English model -- can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. Our approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard in HuggingFace. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The Persian-Phi model is publicly available at https://huggingface.co/amirakhlaghiqqq/PersianPhi.

Paper Structure

This paper contains 34 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The cross-lingual adaptation pipeline began with a warm-up phase (left), which introduced an extended tokenizer with new Persian tokens. Using translated Tiny Stories, this stage employed low-rank LoRA fine-tuning alongside full fine-tuning of the new embedding and head parameters to smoothly initialize and align these tokens. The main adaptation phase (right) involved continual pre-training on a large, filtered and deduplicated Persian corpus, utilizing higher-rank LoRA on attention/feed-forward layers plus full embedding/head tuning to build deep language understanding. Finally, supervised fine-tuning (SFT) was performed on a mixed instruction dataset (Persian, English), again using LoRA, to refine instruction-following and conversational abilities in both languages, ensuring Persian proficiency was added while retaining original English capabilities.
  • Figure 2: Heatmap of Cosine similarity for English and Persian Equivalent Tokens. This heatmap highlights that tokens with similar meanings have higher similarity scores, emphasizing the effectiveness of the language warm-up process on Persian.