Table of Contents
Fetching ...

Gazelle: An Instruction Dataset for Arabic Writing Assistance

Samar M. Magdy, Fakhraddin Alwajih, Sang Yun Kwon, Reem Abdel-Salam, Muhammad Abdul-Mageed

TL;DR

Gazelle, a comprehensive dataset for Arabic writing assistance, and an evaluation framework designed to enhance Arabic writing assistance tools are presented, underscoring the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing.

Abstract

Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.

Gazelle: An Instruction Dataset for Arabic Writing Assistance

TL;DR

Gazelle, a comprehensive dataset for Arabic writing assistance, and an evaluation framework designed to enhance Arabic writing assistance tools are presented, underscoring the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing.

Abstract

Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.

Paper Structure

This paper contains 50 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Task categorization for Gazelle. GEC: Grammatical Error Correction (التصحيح النحوي); MWEs: Multi-Word Expressions.
  • Figure 2: Examples of corrections and explanations for various Arabic writing tasks included in Gazelle. For instructions in English, see Table \ref{['fig:subtasks']}
  • Figure 3: Overall inter-annotator agreement for human evaluation measured by Cohen's Kappa.
  • Figure 4: Results of human evaluation for four LLM models: GPT-$4$, GPT-$4o$, Cohere Command R+, and Gemini $1.5$ Pro on five subtasks in Gazelle.
  • Figure D.1: Examples of writing assistance instructions in Arabic and English.
  • ...and 3 more figures