Table of Contents
Fetching ...

ALLaM: Large Language Models for Arabic and English

M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham A. Alyahya, Sultan AlRashed, Faisal A. Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan

TL;DR

ALLaM presents Arabic-English large language models built with tokenizer augmentation and expanded vocabulary learning to enable rapid Arabic acquisition without eroding English capabilities. The approach combines continued pretraining and scratch training with carefully balanced Arabic/English data, plus strong alignment via supervised fine-tuning and Direct Preference Optimization. Extensive automatic, LLM-based, and human evaluations demonstrate state-of-the-art Arabic performance across benchmarks while improving English proficiency, underscoring the value of high-quality alignment data and translated content for cross-lingual models. The work also discusses ethical, safety, and environmental considerations, highlighting practical implications for Arabic language ecosystems and broader multilingual NLP.

Abstract

We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.

ALLaM: Large Language Models for Arabic and English

TL;DR

ALLaM presents Arabic-English large language models built with tokenizer augmentation and expanded vocabulary learning to enable rapid Arabic acquisition without eroding English capabilities. The approach combines continued pretraining and scratch training with carefully balanced Arabic/English data, plus strong alignment via supervised fine-tuning and Direct Preference Optimization. Extensive automatic, LLM-based, and human evaluations demonstrate state-of-the-art Arabic performance across benchmarks while improving English proficiency, underscoring the value of high-quality alignment data and translated content for cross-lingual models. The work also discusses ethical, safety, and environmental considerations, highlighting practical implications for Arabic language ecosystems and broader multilingual NLP.

Abstract

We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models.
Paper Structure (40 sections, 11 figures, 8 tables)

This paper contains 40 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Performance on Arabic arabicmmlumbzu and English mmlu MMLU Benchmarks. ALLaM (red line) shows impressive improvement from its base model Llama-2 (yellow line). All evaluations were done on the latest version of the fine-tuned (chat or instruct) models. The ALLaM 7B from scratch model also shows significant improvement over the ALLaM 7B continued pretraining model.
  • Figure 2: Comparison of tokenizer fertility rates. The chart illustrates the fertility rates across four tokenizers: Llama-2, ALLaM Arabic only, ALLaM merged with Llama-2, and ALLaM Arabic/English (from scratch model). We calculate the fertility over a random subsample of the entire English, Arabic, and code training corpus.
  • Figure 3: Measuring the effect of adding machine translated Arabic data to pretraining. Although the two loss curves look normal (left), adding the translated Arabic reduced the frequency of gradient spikes during training (center). Adding translated Arabic data also clearly helps align the Arabic and English capabilities of the model and reduce catastrophic forgetting (right).
  • Figure 4: We determine the optimal Arabic/English language mixture that balances between acquiring Arabic understanding while retaining English proficiency by conducting ablations over 6 Arabic/English ratios (trained up to 20B tokens). We found that a 45/55 Arabic/English ratio achieves the best performance, as measured by English and translated Arabic MMLU.
  • Figure 5: Effect of random initialization vs. embedding initialization during the start of continued pretraining. We find that initializing the embeddings for new tokens from combinations of existing embeddings speeds up learning dramatically.
  • ...and 6 more figures