Creating Arabic LLM Prompts at Scale
Abdelrahman El-Sheikh, Ahmed Elmogtaba, Kareem Darwish, Muhammad Elmallah, Ashraf Elneima, Hassan Sawaf
TL;DR
The paper tackles scaling Arabic instruction-following prompts by proposing two complementary methods: translating English prompt datasets with automated quality filtering and creating prompts directly from Arabic NLP datasets. It reports the generation of a very large Arabic prompt corpus (over 87 million prompts) and demonstrates that fine-tuning a base Qwen2 7B model with this data, using LoRA, significantly improves instruction-following performance on Arabic tasks, often surpassing larger instruction-tuned models such as Llama3 70B. The approach highlights the value of data scale and quality control (including COMET-QE filtering and manual verification) in building effective Arabic LLM capabilities and suggests these methods can generalize to other languages. Overall, the work provides a practical, scalable pipeline for high-quality prompt generation and shows tangible gains in Arabic instruction following through targeted finetuning.
Abstract
The debut of chatGPT and BARD has popularized instruction following text generation using LLMs, where a user can interrogate an LLM using natural language requests and obtain natural language answers that matches their requests. Training LLMs to respond in this manner requires a large number of worked out examples of user requests (aka prompts) with corresponding gold responses. In this paper, we introduce two methods for creating such prompts for Arabic cheaply and quickly. The first methods entails automatically translating existing prompt datasets from English, such as PromptSource and Super-NaturalInstructions, and then using machine translation quality estimation to retain high quality translations only. The second method involves creating natural language prompts on top of existing Arabic NLP datasets. Using these two methods we were able to create more than 67.4 million Arabic prompts that cover a variety of tasks including summarization, headline generation, grammar checking, open/closed question answering, creative writing, etc. We show that fine tuning an open 7 billion parameter large language model, namely base Qwen2 7B, enables it to outperform a state-of-the-art 70 billion parameter instruction tuned model, namely Llama3 70B, in handling Arabic prompts.
