Table of Contents
Fetching ...

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

Fakhraddin Alwajih, Gagan Bhatia, Muhammad Abdul-Mageed

TL;DR

This work tackles the scarcity of high-quality multimodal data for Arabic and the challenge of dialect variation by introducing Dallah, a dialect-aware multimodal LLM built from LLaVA and AraLLaMA. It employs a translate-and-filter data pipeline to produce high-quality Arabic image-text data and constructs a six-dialect dataset to enable dialectal instruction-tuning across MSA and regional varieties. The model undergoes a three-stage training regimen (pre-training, visual instruction fine-tuning, and dialectal instruction-tuning) and is evaluated on both MSA and dialect-specific benchmarks, achieving state-of-the-art results and competitive human-aligned assessments. The work advances Arabic NLP by enabling robust, dialect-aware multimodal understanding and generation, with practical implications for education, culture, and inclusive AI; it also introduces Dallah-Bench and analyzes model-vs-human evaluation dynamics, outlining future improvements in cultural representation and evaluation metrics.

Abstract

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

TL;DR

This work tackles the scarcity of high-quality multimodal data for Arabic and the challenge of dialect variation by introducing Dallah, a dialect-aware multimodal LLM built from LLaVA and AraLLaMA. It employs a translate-and-filter data pipeline to produce high-quality Arabic image-text data and constructs a six-dialect dataset to enable dialectal instruction-tuning across MSA and regional varieties. The model undergoes a three-stage training regimen (pre-training, visual instruction fine-tuning, and dialectal instruction-tuning) and is evaluated on both MSA and dialect-specific benchmarks, achieving state-of-the-art results and competitive human-aligned assessments. The work advances Arabic NLP by enabling robust, dialect-aware multimodal understanding and generation, with practical implications for education, culture, and inclusive AI; it also introduces Dallah-Bench and analyzes model-vs-human evaluation dynamics, outlining future improvements in cultural representation and evaluation metrics.

Abstract

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
Paper Structure (39 sections, 3 equations, 12 figures, 3 tables)

This paper contains 39 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Map highlighting the countries targeted by Dallah for dialectal Arabic dataset construction.
  • Figure 2: This figure illustrates the translation and filtering process used in constructing the Arabic dataset for Dallah. The red rows represent examples that were removed due to low similarity scores between the original English text and the back-translated English text. The green rows show the retained examples that met the similarity threshold, ensuring high-quality translations for effective model training.
  • Figure 3: Illustration of the translation and filtering process for constructing high-quality Arabic multimodal datasets. Examples illustrating the results of this pipeline are in Figure \ref{['fig:trans_translate_filter']}.
  • Figure 4: Dallah model architecture, showcasing the integration of the vision encoder, projector, and language model.
  • Figure 5: Training schema for Dallah, detailing the pre-training and visual instruction supervised fine-tuning phases.
  • ...and 7 more figures