Table of Contents
Fetching ...

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

Sara Ghaboura, Ketan More, Wafa Alghallabi, Omkar Thawakar, Jorma Laaksonen, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

TL;DR

ARB addresses the underrepresentation of Arabic in multimodal reasoning by delivering the first Arabic-centric benchmark for step-by-step reasoning across 11 domains, with 1,356 samples and 5,119 reasoning traces. It employs a multi-source data pipeline (translations, Arabic QA, synthetic content, tool-augmented data) refined via a human-in-the-loop and native-speaker validation, plus an LLM-based judgment framework to assess reasoning quality. Evaluations on 12 open- and closed-source large multimodal models reveal a consistent gap between reasoning coherence and final-answer correctness, underscoring the need for Arabic-specific, interpretability-focused benchmarks. The paper provides an open-source dataset, evaluation rubric, and tooling to support reproducibility and future research, advancing Arabic-native, culturally grounded multimodal AI. ARB thus offers a rigorous, reproducible platform for diagnosing and improving Arabic multimodal reasoning.

Abstract

As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark

TL;DR

ARB addresses the underrepresentation of Arabic in multimodal reasoning by delivering the first Arabic-centric benchmark for step-by-step reasoning across 11 domains, with 1,356 samples and 5,119 reasoning traces. It employs a multi-source data pipeline (translations, Arabic QA, synthetic content, tool-augmented data) refined via a human-in-the-loop and native-speaker validation, plus an LLM-based judgment framework to assess reasoning quality. Evaluations on 12 open- and closed-source large multimodal models reveal a consistent gap between reasoning coherence and final-answer correctness, underscoring the need for Arabic-specific, interpretability-focused benchmarks. The paper provides an open-source dataset, evaluation rubric, and tooling to support reproducibility and future research, advancing Arabic-native, culturally grounded multimodal AI. ARB thus offers a rigorous, reproducible platform for diagnosing and improving Arabic multimodal reasoning.

Abstract

As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most benchmarks remain focused on English, overlooking languages with rich linguistic and cultural contexts, such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB spans 11 diverse domains, including visual reasoning, document understanding, OCR, scientific analysis, and cultural interpretation. It comprises 1,356 multimodal samples paired with 5,119 human-curated reasoning steps and corresponding actions. We evaluated 12 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB offers a structured framework for diagnosing multimodal reasoning in underrepresented languages and marks a critical step toward inclusive, transparent, and culturally aware AI systems. We release the benchmark, rubric, and evaluation suit to support future research and reproducibility. Code available at: https://github.com/mbzuai-oryx/ARB

Paper Structure

This paper contains 25 sections, 22 figures, 5 tables.

Figures (22)

  • Figure 1: ARB Dataset Diversity. ARB comprises a wide array of multimodal reasoning samples, each combining a visual input with an Arabic question and detailed step-by-step reasoning with actions taken by step. The dataset spans 11 distinct domains, including visual reasoning, OCR and document understanding, chart and diagram interpretation, mathematical and logical inference, scientific and medical analysis, cultural and historical interpretation, remote sensing, agricultural image analysis, and complex visual perception—capturing the linguistic richness, cultural depth, and cross-domain complexity essential for evaluating reasoning in Arabic.
  • Figure 2: The ARB Dataset Pipeline. The figure illustrates the ARB pipeline for evaluating Arabic multimodal reasoning in LMMs. It begins with data collection across 11 domains—such as medical imaging, historical interpretation, visual reasoning, and agriculture—sourced from curated datasets (e.g., VRC-Bench, CAMEL-Bench), synthetic content, tool-augmented outputs, and web scraping. Data is generated across five categories: English reasoning chains, Arabic Q&A, English captions, synthetic samples, and tool-enhanced content. Reasoning steps are refined via human-in-the-loop feedback and filtered for logical consistency and cultural alignment. The benchmark supports fine-grained evaluation of open- and closed-source models on Arabic step-by-step reasoning.
  • Figure 3: Overview of the ARB Data Collection, Generation and Verification Framework. The ARB benchmark is constructed from five primary data sources: (1) English reasoning benchmarks, (2) Arabic question–answer benchmarks, (3) English-captioned datasets, (4) Synthetic data, and (5) Tool-augmented data. All data undergoes iterative refinement through human-in-the-loop feedback and validation by native Arabic speakers to ensure cultural and linguistic fidelity.
  • Figure 4: ARB Evaluation Prompt. The standardized Arabic prompt used across all ARB domains to elicit structured, curriculum-based reasoning steps from evaluated models during inference. The English version is provided in Appendix \ref{['sec:app_LLM_Prompt']}.
  • Figure 5: Arabic Reasoning Evaluation Metrics. We assess step-by-step reasoning using five core Arabic-specific dimensions: Faithfulness (At-Tatābuq), Informativeness (Al-Ithrā' Al-Ma'lūmātī), Coherence (At-Tawāfuq), Commonsense (Al-Mantiq Al-'Āmm), and Reasoning Alignment (At-Tawāfuq Al-Istidlālī). Auxiliary checks cover hallucinations, redundancy, semantic gaps, and missing steps. Metrics are defined at the step and/or token level. The full evaluation rubric is provided in English in Appendix \ref{['sec:app_LLM_Prompt']}.
  • ...and 17 more figures