Table of Contents
Fetching ...

GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Hasna Chouikhi, Manel Aloui, Cyrine Ben Hammou, Ghaith Chaabane, Haithem Kchaou, Chehir Dhaouadi

TL;DR

This work tackles the lack of Arabic instruction-tuning resources by introducing InstAr-500k, a hybrid Arabic instruction dataset assembled from synthetic data generated via seed prompts and human-crafted content. It fine-tunes Gemma-7B-IT using LoRA within the LLaMAFactory framework to create GemmAr-7B-V1, and evaluates on a comprehensive suite of Arabic benchmarks via the Open Arabic LLM Evaluation Leaderboard (OALL), achieving strong multi-domain performance (average ~47.27% on OALL). The dataset construction combines monolingual knowledge distillation and diverse data sources, with a dedicated pipeline for data generation, cleaning, classification, and formatting, ensuring high-quality instruction-output pairs. The results indicate a meaningful narrowing of the English–Arabic performance gap in NLP tasks and demonstrate the practical impact of targeted Arabic instruction data for scalable Arabic NLP development, while acknowledging dialectal coverage and hardware constraints as areas for future work.

Abstract

Large language models (LLMs) have greatly impacted the natural language processing (NLP) field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of language models largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to address a variety of prompts accurately. However, the availability and quality of these resources vary by language. While models perform well in English, they often need help with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks. These outcomes emphasize the effectiveness of our dataset in elevating the capabilities of language models for Arabic. Our instruction dataset bridges the performance gap between English and Arabic language models by providing resources that amplify Arabic NLP development. Building on this foundation, we developed a model, GemmAr-7B-V1, specifically tuned to excel at a wide range of Arabic NLP tasks.

GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

TL;DR

This work tackles the lack of Arabic instruction-tuning resources by introducing InstAr-500k, a hybrid Arabic instruction dataset assembled from synthetic data generated via seed prompts and human-crafted content. It fine-tunes Gemma-7B-IT using LoRA within the LLaMAFactory framework to create GemmAr-7B-V1, and evaluates on a comprehensive suite of Arabic benchmarks via the Open Arabic LLM Evaluation Leaderboard (OALL), achieving strong multi-domain performance (average ~47.27% on OALL). The dataset construction combines monolingual knowledge distillation and diverse data sources, with a dedicated pipeline for data generation, cleaning, classification, and formatting, ensuring high-quality instruction-output pairs. The results indicate a meaningful narrowing of the English–Arabic performance gap in NLP tasks and demonstrate the practical impact of targeted Arabic instruction data for scalable Arabic NLP development, while acknowledging dialectal coverage and hardware constraints as areas for future work.

Abstract

Large language models (LLMs) have greatly impacted the natural language processing (NLP) field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of language models largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to address a variety of prompts accurately. However, the availability and quality of these resources vary by language. While models perform well in English, they often need help with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types. We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality. Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks. These outcomes emphasize the effectiveness of our dataset in elevating the capabilities of language models for Arabic. Our instruction dataset bridges the performance gap between English and Arabic language models by providing resources that amplify Arabic NLP development. Building on this foundation, we developed a model, GemmAr-7B-V1, specifically tuned to excel at a wide range of Arabic NLP tasks.
Paper Structure (21 sections, 5 figures, 5 tables)

This paper contains 21 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the InstAr-500k dataset construction pipeline.
  • Figure 5: Performance Scores Comparison of GemmAr-7B-V1, AceGPT-7B-chat, Gemma 7B-IT across different Benchmarks.
  • Figure 6: Example of the prompt, context, and output for the Open QA task.
  • Figure 7: Example of the prompt, context, and output for the Extraction task.
  • Figure 8: Example of the prompt, context, and output for the Explanation task.