Table of Contents
Fetching ...

AIN: The Arabic INclusive Large Multimodal Model

Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan

TL;DR

This work tackles the under-explored space of Arabic-inclusive multimodal models by introducing AIN, a bilingual English–Arabic 7B-LMM built on the Qwen-2-VL-7B base and trained on 3.6 million Arabic–English multimodal samples. It demonstrates state-of-the-art Arabic performance across CAMEL-Bench tasks and robust cross-lingual English capabilities, validated by Arabic MMLU benchmarks and a multi-domain human evaluation showing strong user preference for AIN. The authors implement a rigorous data pipeline with translation evaluation, semantic verification via LaBSE, and toxicity filtering, culminating in a safe, high-quality 3.6M dataset for training. Overall, AIN represents a substantial step toward accessible, high-performing multimodal AI tools for Arabic speakers, with broad implications for cross-domain understanding, OCR, cultural insight, and domain-specific applications.

Abstract

Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.

AIN: The Arabic INclusive Large Multimodal Model

TL;DR

This work tackles the under-explored space of Arabic-inclusive multimodal models by introducing AIN, a bilingual English–Arabic 7B-LMM built on the Qwen-2-VL-7B base and trained on 3.6 million Arabic–English multimodal samples. It demonstrates state-of-the-art Arabic performance across CAMEL-Bench tasks and robust cross-lingual English capabilities, validated by Arabic MMLU benchmarks and a multi-domain human evaluation showing strong user preference for AIN. The authors implement a rigorous data pipeline with translation evaluation, semantic verification via LaBSE, and toxicity filtering, culminating in a safe, high-quality 3.6M dataset for training. Overall, AIN represents a substantial step toward accessible, high-performing multimodal AI tools for Arabic speakers, with broad implications for cross-domain understanding, OCR, cultural insight, and domain-specific applications.

Abstract

Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.

Paper Structure

This paper contains 16 sections, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Cross-domain performance analysis on the Camel-Bench Benchmark. Our AIN-7B achieves promising performance compared to significantly bigger models (GPT-4o and Gemini-1.5-Pro) in both domain-specific and aggregate settings. Despite its smaller size, our AIN-7B achieves competitive performance across all 38 sub-domains with significantly superior capabilities on OCR & document understanding.
  • Figure 2: AIN: A versatile LMM excelling in visual and contextual understanding across diverse domains, including VQA on complex topics, OCR for various fonts and handwriting, cultural insights (traditions, food, places), agricultural tasks (crop identification, fruit classification, disease detection), remote sensing (multi-scale objects), medical imaging (various modalities), and video analysis (animation, human activities).
  • Figure 3: AIN compared to existing LMMs across CAMEL-Bench benchmark ghaboura2024camel domains: OCR: "OCR & Document Understanding", Video: "General Video & Multi-Image Understanding", RS: "Remote Sensing Understanding", CDT:"Chart, Diagram & Table Understanding", Agro.: "Agricultural Image Understanding", Cultural: "Cultural-Specific Understanding", Medical: "Medical Image Understanding".
  • Figure 4: Qualitative results demonstrating AIN's comprehensive capabilities across diverse domains. The results show its proficiency in handling both multiple-choice and open-ended questions. Our proposed AIN exhibits robust performance in addressing queries related to visual attributes (shape, color, quantity), while maintaining appropriate response formats (single character, word, or complete sentence) according to task requirements.
  • Figure 5: Comparison of AIN with GPT-4o gpt4o and LLaVA li2024llava across diverse tasks. The evaluation demonstrates AIN's proficiency in handling both multiple-choice and open-ended questions while maintaining appropriate response formats.
  • ...and 11 more figures