Table of Contents
Fetching ...

JEEM: Vision-Language Understanding in Four Arabic Dialects

Karima Kadaoui, Hanin Atwany, Hamdan Al-Ali, Abdelrahman Mohamed, Ali Mekky, Sergei Tilga, Natalia Fedorova, Ekaterina Artemova, Hanan Aldarmaki, Yova Kementchedjhieva

TL;DR

JEEM presents a culturally informed vision-language benchmark for four Arabic dialects (Jordanian, Egyptian, Emirati, and Moroccan) to evaluate image captioning and visual question answering. It details a rigorous data-collection pipeline with native-dialect annotation, dialect-first captions, and dialect-specific QA, complemented by cross-dialect shared content. The study benchmarks several Arabic VLMs and GPT-4o using traditional, GPT-based, and human evaluations, revealing persistent gaps in dialectal understanding and cultural grounding, even for strong models like GPT-4o. Key findings show open-source models lag behind GPT-4o in most metrics, with dialect-resource disparities (notably Emirati) driving performance differences, underscoring the need for more inclusive, dialect-aware training and evaluation. JEEM provides a framework for culturally diverse assessment and highlights practical implications for deploying VLMs in Arabic-speaking regions.

Abstract

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

JEEM: Vision-Language Understanding in Four Arabic Dialects

TL;DR

JEEM presents a culturally informed vision-language benchmark for four Arabic dialects (Jordanian, Egyptian, Emirati, and Moroccan) to evaluate image captioning and visual question answering. It details a rigorous data-collection pipeline with native-dialect annotation, dialect-first captions, and dialect-specific QA, complemented by cross-dialect shared content. The study benchmarks several Arabic VLMs and GPT-4o using traditional, GPT-based, and human evaluations, revealing persistent gaps in dialectal understanding and cultural grounding, even for strong models like GPT-4o. Key findings show open-source models lag behind GPT-4o in most metrics, with dialect-resource disparities (notably Emirati) driving performance differences, underscoring the need for more inclusive, dialect-aware training and evaluation. JEEM provides a framework for culturally diverse assessment and highlights practical implications for deploying VLMs in Arabic-speaking regions.

Abstract

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

Paper Structure

This paper contains 46 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: A sample from JEEM (Moroccan set).
  • Figure 2: Dialectal coverage of JEEM. The country-level dialects used are shown in dark colors along with their respective region-level dialects in lighter color. The regional classification follows the work of book-habash.
  • Figure 3: Topic distribution per dialect.
  • Figure 4: Image of a Omani Halwa (image sourced from the Emirati set) shared with annotators across all dialects. The Jordanian, Egyptian and Moroccan captions demonstrate an incorrect identification of the dessert and its components.
  • Figure 5: Distribution of annotators based on the number of tasks completed for three tasks: Image Captioning, Question Writing, and Answer Writing. Each bar represents the number of writers contributing within a given range, with colors indicating different dialects. Y-axis: number of unique writers. X-axis: the number of tasks grouped into intervals.
  • ...and 6 more figures