Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Neha Sengupta; Sunil Kumar Sahu; Bokang Jia; Satheesh Katipomu; Haonan Li; Fajri Koto; William Marshall; Gurpreet Gosal; Cynthia Liu; Zhiming Chen; Osama Mohammed Afzal; Samta Kamboj; Onkar Pandit; Rahul Pal; Lalit Pradhan; Zain Muhammad Mujahid; Massa Baali; Xudong Han; Sondos Mahmoud Bsharat; Alham Fikri Aji; Zhiqiang Shen; Zhengzhong Liu; Natalia Vassilieva; Joel Hestness; Andy Hock; Andrew Feldman; Jonathan Lee; Andrew Jackson; Hector Xuguang Ren; Preslav Nakov; Timothy Baldwin; Eric Xing

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing

TL;DR

<3-5 sentence high-level summary>Jais and Jais-chat introduce a 13B Arabic-centric decoder-only LLM and its instruction-tuned variant, trained on a large bilingual corpus (Arabic and English) with substantial programming code. The authors implement a custom Arabic preprocessing pipeline, a dedicated Jais tokenizer, ALiBi positional encodings, SwiGLU activations, and maximal update parametrization to maximize cross-lingual transfer and data efficiency. They conduct extensive downstream and generation evaluations in Arabic and English, showing state-of-the-art Arabic performance among open models and competitive English performance despite less English data. Safety is addressed via instruction-tuning, prompting, external detectors, and keyword-based filtering, and the models are released under Apache 2.0 to promote open research and Arabic NLP growth.

Abstract

We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than any existing open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models. We release two open versions of the model -- the foundation Jais model, and an instruction-tuned Jais-chat variant -- with the aim of promoting research on Arabic LLMs. Available at https://huggingface.co/inception-mbzuai/jais-13b-chat

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

TL;DR

Abstract

Paper Structure (47 sections, 8 figures, 25 tables)

This paper contains 47 sections, 8 figures, 25 tables.

Introduction
Pretraining Data
Preprocessing Pipeline
Mixing Arabic and English Data
Model
Model Architecture
Jais Tokenizer:
ALiBi Positional Encodings:
SwiGLU Activation Function:
Maximal Update Parametrization:
Model and Training Hyperparameters
Learnings and Observations
Training Infrastructure
Instruction-Tuning
Instruction-Tuning Data
...and 32 more sections

Figures (8)

Figure 1: English--Arabic multiturn dialogue using Jais-chat.
Figure 2: Our Arabic preprocessing pipeline.
Figure 3: Cross-entropy loss on different model sizes with different configurations.
Figure 4: Our templates for instruction-tuning: the prompt is in blue, and the response is in green.
Figure 5: GPT-4 evaluation results for Jais-chat compared to open- and closed-source models on Arabic open-ended questions. The minimum and the maximum possible scores are 0 and 4,000, respectively.
...and 3 more figures

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

TL;DR

Abstract

Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)