CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks
Zhaozhi Qian, Faroq Altam, Muhammad Alqurishi, Riad Souissi
TL;DR
Arabic language AI has suffered from data scarcity and cultural misalignment. This work presents Juhaina, a $9.24$B parameter decoder-only LLM with a $8192$-token context, post-trained on Gemma 2 to improve Arabic proficiency, factual regional knowledge, and cultural alignment, and introduces CamelEval to better assess instruction-following and cultural nuance beyond traditional benchmarks. The authors employ a two-stage post-training pipeline—SFT using Llama Pro and ORPO-based alignment with human feedback—together with a rigorous data-collection workflow that prioritizes data quality. They demonstrate that Juhaina outperforms comparable-sized models on Arabic tasks and shows competitive performance on challenging, culturally nuanced queries, while highlighting limitations of the OALL benchmark. The weights and CamelEval resources are released under MIT, aiming to democratize access to advanced Arabic AI for more than $400$ million Arabic speakers and to provide a more realistic evaluation framework for future Arabic LLMs.
Abstract
Large Language Models (LLMs) are the cornerstones of modern artificial intelligence systems. This paper introduces Juhaina, a Arabic-English bilingual LLM specifically designed to align with the values and preferences of Arabic speakers. Juhaina inherently supports advanced functionalities such as instruction following, open-ended question answering, information provisioning, and text processing. Our model contains 9.24 billion parameters and is trained on a context window of up to 8,192 tokens. This paper details the creation process of Juhaina and provides an extensive empirical evaluation. Furthermore, we identify the limitations of widely-adopted Open Arabic LLM Leaderboard (OALL) and propose a new evaluation benchmark, CamelEval. Our findings demonstrate that Juhaina surpasses existing LLMs of comparable sizes, such as the Llama and Gemma families, in generating helpful responses in Arabic, providing factually accurate information about the region, and understanding nuanced cultural aspects. We aspire for Juhaina to democratize cutting-edge AI technologies, serving over 400 million Arabic speakers by offering LLMs that not only communicate in their language but also comprehend their culture. We publicly release all models on Huggingface \url{https://huggingface.co/elmrc}.
