Table of Contents
Fetching ...

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

Jianqing Zhu, Huang Huang, Zhihang Lin, Juhao Liang, Zhengyang Tang, Khalid Almubarak, Abdulmohsen Alharthik, Bang An, Juncai He, Xiangbo Wu, Fei Yu, Junying Chen, Zhuoheng Ma, Yuhao Du, He Zhang, Emad A. Alghamdi, Lian Zhang, Ruoyu Sun, Haizhou Li, Benyou Wang, Jinchao Xu

TL;DR

This work tackles the lag of Arabic LLM development by introducing a SLA-inspired, progressive vocabulary expansion strategy that grows Arabic subwords during training. It presents Incremental Byte Pair Encoding (I-BPE) to expand the vocabulary across 16 stages, balancing OOV ratios and preserving prior knowledge, and demonstrates its effectiveness through extensive benchmarks. The AraLLaMA models (7B and 13B) achieve strong Arabic performance, rivaling or surpassing larger multilingual peers, with substantially faster decoding than prior Arabic models. The approach is complemented by Arabic instruction tuning (ALAN) and full open-sourcing of data, weights, and pipelines, which collectively advance open-access, high-quality Arabic NLP and cross-lingual transfer capabilities.

Abstract

This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or ChatGPT 3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. However, using a different vocabulary often leads to a degradation of learned knowledge since many words are initially out-of-vocabulary (OOV) when training starts. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion. Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Models, training data, benchmarks, and codes will be all open-sourced.

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

TL;DR

This work tackles the lag of Arabic LLM development by introducing a SLA-inspired, progressive vocabulary expansion strategy that grows Arabic subwords during training. It presents Incremental Byte Pair Encoding (I-BPE) to expand the vocabulary across 16 stages, balancing OOV ratios and preserving prior knowledge, and demonstrates its effectiveness through extensive benchmarks. The AraLLaMA models (7B and 13B) achieve strong Arabic performance, rivaling or surpassing larger multilingual peers, with substantially faster decoding than prior Arabic models. The approach is complemented by Arabic instruction tuning (ALAN) and full open-sourcing of data, weights, and pipelines, which collectively advance open-access, high-quality Arabic NLP and cross-lingual transfer capabilities.

Abstract

This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or ChatGPT 3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. However, using a different vocabulary often leads to a degradation of learned knowledge since many words are initially out-of-vocabulary (OOV) when training starts. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion. Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Models, training data, benchmarks, and codes will be all open-sourced.

Paper Structure

This paper contains 48 sections, 3 figures, 14 tables, 1 algorithm.

Figures (3)

  • Figure 1: Second Language Acquisition for human, an English-speaking child's journey to Arabic fluency, from basic vocabulary to cultural roficiency
  • Figure 2: Compression ratio comparison between uniform and exponential vocabulary expansion strategies.
  • Figure 3: Loss curve of TinyLLaMa with sliding window average

Theorems & Definitions (1)

  • Definition 1