CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

Zhaozhi Qian; Faroq Altam; Muhammad Alqurishi; Riad Souissi

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

Zhaozhi Qian, Faroq Altam, Muhammad Alqurishi, Riad Souissi

TL;DR

Arabic language AI has suffered from data scarcity and cultural misalignment. This work presents Juhaina, a $9.24$B parameter decoder-only LLM with a $8192$-token context, post-trained on Gemma 2 to improve Arabic proficiency, factual regional knowledge, and cultural alignment, and introduces CamelEval to better assess instruction-following and cultural nuance beyond traditional benchmarks. The authors employ a two-stage post-training pipeline—SFT using Llama Pro and ORPO-based alignment with human feedback—together with a rigorous data-collection workflow that prioritizes data quality. They demonstrate that Juhaina outperforms comparable-sized models on Arabic tasks and shows competitive performance on challenging, culturally nuanced queries, while highlighting limitations of the OALL benchmark. The weights and CamelEval resources are released under MIT, aiming to democratize access to advanced Arabic AI for more than $400$ million Arabic speakers and to provide a more realistic evaluation framework for future Arabic LLMs.

Abstract

Large Language Models (LLMs) are the cornerstones of modern artificial intelligence systems. This paper introduces Juhaina, a Arabic-English bilingual LLM specifically designed to align with the values and preferences of Arabic speakers. Juhaina inherently supports advanced functionalities such as instruction following, open-ended question answering, information provisioning, and text processing. Our model contains 9.24 billion parameters and is trained on a context window of up to 8,192 tokens. This paper details the creation process of Juhaina and provides an extensive empirical evaluation. Furthermore, we identify the limitations of widely-adopted Open Arabic LLM Leaderboard (OALL) and propose a new evaluation benchmark, CamelEval. Our findings demonstrate that Juhaina surpasses existing LLMs of comparable sizes, such as the Llama and Gemma families, in generating helpful responses in Arabic, providing factually accurate information about the region, and understanding nuanced cultural aspects. We aspire for Juhaina to democratize cutting-edge AI technologies, serving over 400 million Arabic speakers by offering LLMs that not only communicate in their language but also comprehend their culture. We publicly release all models on Huggingface \url{https://huggingface.co/elmrc}.

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

TL;DR

Arabic language AI has suffered from data scarcity and cultural misalignment. This work presents Juhaina, a

B parameter decoder-only LLM with a

-token context, post-trained on Gemma 2 to improve Arabic proficiency, factual regional knowledge, and cultural alignment, and introduces CamelEval to better assess instruction-following and cultural nuance beyond traditional benchmarks. The authors employ a two-stage post-training pipeline—SFT using Llama Pro and ORPO-based alignment with human feedback—together with a rigorous data-collection workflow that prioritizes data quality. They demonstrate that Juhaina outperforms comparable-sized models on Arabic tasks and shows competitive performance on challenging, culturally nuanced queries, while highlighting limitations of the OALL benchmark. The weights and CamelEval resources are released under MIT, aiming to democratize access to advanced Arabic AI for more than

million Arabic speakers and to provide a more realistic evaluation framework for future Arabic LLMs.

Abstract

Paper Structure (22 sections, 5 tables)

This paper contains 22 sections, 5 tables.

Introduction
Creating Juhaina LLMs
Data Collection
Data Sources
Data Cleaning
Prompt Generation
Answer Generation
Postprocessing
Learnings on Data Collection
Post-training Procedure
Supervised Finetuning (SFT)
Alignment with Human Feedback
Infrastructure and Computing
Evaluation of Arabic LLMs
Open Arabic LLM Leaderboard (OALL)
...and 7 more sections

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

TL;DR

Abstract

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

Authors

TL;DR

Abstract

Table of Contents