AIN: The Arabic INclusive Large Multimodal Model

Ahmed Heakl; Sara Ghaboura; Omkar Thawkar; Fahad Shahbaz Khan; Hisham Cholakkal; Rao Muhammad Anwer; Salman Khan

AIN: The Arabic INclusive Large Multimodal Model

Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan

TL;DR

This work tackles the under-explored space of Arabic-inclusive multimodal models by introducing AIN, a bilingual English–Arabic 7B-LMM built on the Qwen-2-VL-7B base and trained on 3.6 million Arabic–English multimodal samples. It demonstrates state-of-the-art Arabic performance across CAMEL-Bench tasks and robust cross-lingual English capabilities, validated by Arabic MMLU benchmarks and a multi-domain human evaluation showing strong user preference for AIN. The authors implement a rigorous data pipeline with translation evaluation, semantic verification via LaBSE, and toxicity filtering, culminating in a safe, high-quality 3.6M dataset for training. Overall, AIN represents a substantial step toward accessible, high-performing multimodal AI tools for Arabic speakers, with broad implications for cross-domain understanding, OCR, cultural insight, and domain-specific applications.

Abstract

Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.

AIN: The Arabic INclusive Large Multimodal Model

TL;DR

Abstract

AIN: The Arabic INclusive Large Multimodal Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)