Table of Contents
Fetching ...

Multimodal Arabic Captioning with Interpretable Visual Concept Integration

Passant Elchafei, Amany Fashwan

TL;DR

VLCAP addresses the need for culturally aligned Arabic image captions by decoupling visual understanding from language generation. It grounds captions in interpretable Arabic visual concepts retrieved via CLIP-based encoders (mCLIP, AraCLIP, Jina V4) against an enriched Arabic visual vocabulary, then guides caption generation with prompts fed to vision-language models (Qwen-VL or Gemini Pro Vision). The study evaluates six encoder–decoder configurations, finding Gemini Pro Vision + mCLIP yields the best lexical and semantic metrics (BLEU-1 and Cosine Similarity), while AraCLIP paired with Qwen-VL achieves the strongest human-aligned scores. The results demonstrate improved cultural relevance and interpretability, with a transferable framework for other low-resource languages.

Abstract

We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and Jina V4, each evaluated separately for label retrieval. A hybrid vocabulary is built from training captions and enriched with about 21K general domain labels translated from the Visual Genome dataset, covering objects, attributes, and scenes. The top-k retrieved labels are transformed into fluent Arabic prompts and passed along with the original image to vision-language models. In the second stage, we tested Qwen-VL and Gemini Pro Vision for caption generation, resulting in six encoder-decoder configurations. The results show that mCLIP + Gemini Pro Vision achieved the best BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained the highest LLM-judge score (36.33%). This interpretable pipeline enables culturally coherent and contextually accurate Arabic captions.

Multimodal Arabic Captioning with Interpretable Visual Concept Integration

TL;DR

VLCAP addresses the need for culturally aligned Arabic image captions by decoupling visual understanding from language generation. It grounds captions in interpretable Arabic visual concepts retrieved via CLIP-based encoders (mCLIP, AraCLIP, Jina V4) against an enriched Arabic visual vocabulary, then guides caption generation with prompts fed to vision-language models (Qwen-VL or Gemini Pro Vision). The study evaluates six encoder–decoder configurations, finding Gemini Pro Vision + mCLIP yields the best lexical and semantic metrics (BLEU-1 and Cosine Similarity), while AraCLIP paired with Qwen-VL achieves the strongest human-aligned scores. The results demonstrate improved cultural relevance and interpretability, with a transferable framework for other low-resource languages.

Abstract

We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and Jina V4, each evaluated separately for label retrieval. A hybrid vocabulary is built from training captions and enriched with about 21K general domain labels translated from the Visual Genome dataset, covering objects, attributes, and scenes. The top-k retrieved labels are transformed into fluent Arabic prompts and passed along with the original image to vision-language models. In the second stage, we tested Qwen-VL and Gemini Pro Vision for caption generation, resulting in six encoder-decoder configurations. The results show that mCLIP + Gemini Pro Vision achieved the best BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained the highest LLM-judge score (36.33%). This interpretable pipeline enables culturally coherent and contextually accurate Arabic captions.

Paper Structure

This paper contains 9 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: VLCAP system overview.The framework operates in two stages: (1) Arabic visual labels are retrieved by computing image–text similarity with a multilingual multimodal encoder (mCLIP, AraCLIP, or Jina V4) against a curated label vocabulary; (2) the retrieved labels are inserted into an Arabic prompt, which together with the original image, is passed to a vision–language model (Qwen-VL or Gemini Pro Vision) to generate the final caption.
  • Figure 2: Visual Labels Vocab Builder. Construction of the Arabic visual label vocabulary: Most frequent content words extracted from the training captions and augmented with general-domain visual concepts translated from the Visual Genome dataset, producing the final vocabulary used for label retrieval.