Table of Contents
Fetching ...

JASMINE: Arabic GPT Models for Few-Shot Learning

El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, AbdelRahim Elmadany, Alcides Alcoba Inciarte, Md Tawkat Islam Khondaker

TL;DR

JASMINE introduces a quartet of Arabic autoregressive Transformer models (350M–6.7B) trained on ~235GB of diverse Arabic text, including CA, MSA, and dialectal data. It provides a comprehensive Arabic evaluation benchmark spanning perplexity, autocompletion, commonsense reasoning, word manipulation, and NLU, demonstrating strong few-shot learning and human-like fluency across dialects. The work also analyzes social biases in AI systems, showing gender, color, region, and religious biases, and emphasizes responsible release and mitigation of risks. Collectively, JASMINE advances Arabic NLP by enabling large-scale, dialect-aware, few-shot capable GPT-style models with an open evaluation framework and explicit ethical considerations.

Abstract

Scholarship on generative pretraining (GPT) remains acutely Anglocentric, leaving serious gaps in our understanding of the whole class of autoregressive models. For example, we have little knowledge about the potential of these models and their societal impacts in diverse linguistic and cultural settings. We alleviate this issue for Arabic, a wide collection of languages and dialectal varieties with more than 400 million population, by introducing JASMINE. JASMINE is a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-6.7 billion parameters pretrained on a large and diverse dataset (~ 235 GB of text). We also carefully design and release a comprehensive benchmark for both automated and human evaluation of Arabic autoregressive models, with coverage of potential social biases, harms, and toxicity. Using our novel benchmark, we evaluate JASMINE extensively showing powerful performance intrinsically as well as in few-shot learning on a wide range of NLP tasks. We aim to responsibly release our models and evaluation benchmark with interested researchers, along with code for experimenting with them.

JASMINE: Arabic GPT Models for Few-Shot Learning

TL;DR

JASMINE introduces a quartet of Arabic autoregressive Transformer models (350M–6.7B) trained on ~235GB of diverse Arabic text, including CA, MSA, and dialectal data. It provides a comprehensive Arabic evaluation benchmark spanning perplexity, autocompletion, commonsense reasoning, word manipulation, and NLU, demonstrating strong few-shot learning and human-like fluency across dialects. The work also analyzes social biases in AI systems, showing gender, color, region, and religious biases, and emphasizes responsible release and mitigation of risks. Collectively, JASMINE advances Arabic NLP by enabling large-scale, dialect-aware, few-shot capable GPT-style models with an open evaluation framework and explicit ethical considerations.

Abstract

Scholarship on generative pretraining (GPT) remains acutely Anglocentric, leaving serious gaps in our understanding of the whole class of autoregressive models. For example, we have little knowledge about the potential of these models and their societal impacts in diverse linguistic and cultural settings. We alleviate this issue for Arabic, a wide collection of languages and dialectal varieties with more than 400 million population, by introducing JASMINE. JASMINE is a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-6.7 billion parameters pretrained on a large and diverse dataset (~ 235 GB of text). We also carefully design and release a comprehensive benchmark for both automated and human evaluation of Arabic autoregressive models, with coverage of potential social biases, harms, and toxicity. Using our novel benchmark, we evaluate JASMINE extensively showing powerful performance intrinsically as well as in few-shot learning on a wide range of NLP tasks. We aim to responsibly release our models and evaluation benchmark with interested researchers, along with code for experimenting with them.
Paper Structure (35 sections, 1 equation, 2 figures, 19 tables)

This paper contains 35 sections, 1 equation, 2 figures, 19 tables.

Figures (2)

  • Figure 1: Overview of AraSWAG dataset creation. On each iteration, a new MARBERT is trained on a dummy training set $\mathcal{D}_{train}$ to identify easily-classified generated endings on the dummy test set $\mathcal{D}_{test}$. The finetuned AraT5 is used to replace easily-classified generated endings with adversarial ones. This process is repeated iteratively to obtain a challenging dataset.
  • Figure 2: Percentages of correlates of bias towards religions/ideologies and religious/ideological groups.