Table of Contents
Fetching ...

Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks

Fakhraddin Alwajih, El Moatez Billah Nagoudi, Gagan Bhatia, Abdelrahman Mohamed, Muhammad Abdul-Mageed

TL;DR

Peacock tackles the scarcity of Arabic multimodal resources by introducing two architecture variants (InstructBlip-based and LLaVA-based) that fuse vision encoders with Arabic LLMs, backed by a two-stage training pipeline that translates and filters English image-text data for Arabic use. It also provides AraLLaMA, a high-quality Arabic-adapted LLaMA2-7B backbone, and introduces Henna, a culturally-focused benchmark, plus an Egyptian-dialect case study to probe dialectal capabilities. Through comprehensive evaluations on VQA, LLaVA-Bench, SEED-Bench (Arabic), Henna, and dialect tasks, Peacock consistently outperforms multilingual baselines like mBlip, highlighting the impact of data quality, architecture choices, and Arabic-specific adaptation. The work establishes strong baselines and resources for Arabic vision-language modeling, enabling future research and culturally-aware applications in the Arabic-speaking world.

Abstract

Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed \textit{Peacock}, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce ~\textit{Henna}, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs.The GitHub repository for the \textit{Peacock} project is available at \url{https://github.com/UBC-NLP/peacock}.

Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks

TL;DR

Peacock tackles the scarcity of Arabic multimodal resources by introducing two architecture variants (InstructBlip-based and LLaVA-based) that fuse vision encoders with Arabic LLMs, backed by a two-stage training pipeline that translates and filters English image-text data for Arabic use. It also provides AraLLaMA, a high-quality Arabic-adapted LLaMA2-7B backbone, and introduces Henna, a culturally-focused benchmark, plus an Egyptian-dialect case study to probe dialectal capabilities. Through comprehensive evaluations on VQA, LLaVA-Bench, SEED-Bench (Arabic), Henna, and dialect tasks, Peacock consistently outperforms multilingual baselines like mBlip, highlighting the impact of data quality, architecture choices, and Arabic-specific adaptation. The work establishes strong baselines and resources for Arabic vision-language modeling, enabling future research and culturally-aware applications in the Arabic-speaking world.

Abstract

Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed \textit{Peacock}, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce ~\textit{Henna}, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs.The GitHub repository for the \textit{Peacock} project is available at \url{https://github.com/UBC-NLP/peacock}.
Paper Structure (34 sections, 21 figures, 7 tables)

This paper contains 34 sections, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Comparison between the performance of Peacock and mBlip models on SEED-Benchmark dimensions.
  • Figure 2: Peacock InstructBLIP architecture: Integrates instruction-specific visual features using Q-Former and a frozen pretrained image encoder.
  • Figure 3: Peacock LLaVA architecture: Combines a pretrained frozen vision encoder with trained Arabic LLMs via an MLP bridge.
  • Figure 4: Our data filtering pipeline. After translating the data through Google Cloud API, we obtain the embeddings of both the original and translated samples using the multilingual sentence embedding model LaBSE. For each sample, we calculate the cosine similarity between the two extracted embeddings and reject samples under an 80% threshold.
  • Figure 5: Examples of responses from Peacock and GPT-4V regarding an image related to Yemeni culture.
  • ...and 16 more figures