Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic
Fakhraddin Alwajih, Gagan Bhatia, Muhammad Abdul-Mageed
TL;DR
This work tackles the scarcity of high-quality multimodal data for Arabic and the challenge of dialect variation by introducing Dallah, a dialect-aware multimodal LLM built from LLaVA and AraLLaMA. It employs a translate-and-filter data pipeline to produce high-quality Arabic image-text data and constructs a six-dialect dataset to enable dialectal instruction-tuning across MSA and regional varieties. The model undergoes a three-stage training regimen (pre-training, visual instruction fine-tuning, and dialectal instruction-tuning) and is evaluated on both MSA and dialect-specific benchmarks, achieving state-of-the-art results and competitive human-aligned assessments. The work advances Arabic NLP by enabling robust, dialect-aware multimodal understanding and generation, with practical implications for education, culture, and inclusive AI; it also introduces Dallah-Bench and analyzes model-vs-human evaluation dynamics, outlining future improvements in cultural representation and evaluation metrics.
Abstract
Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
