Table of Contents
Fetching ...

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan

TL;DR

ViMUL addresses the gap in multilingual, culturally aware video understanding by introducing ViMUL-Bench, a benchmark spanning 14 languages and 15 domains with 8K native-verified QA pairs across generic and cultural content. It also presents ViMUL, a multilingual video LMM built on a SigLIP-based vision encoder and a Qwen-2.0 language model, trained via a large-scale multilingual instruction-tuning dataset (1,238,102 translated samples) and evaluated with cycle-consistency checks. Results show that while closed-source GPT-4o dominates, ViMUL offers a competitive and more balanced performance across high- and low-resource languages, particularly on long, culturally nuanced videos. The work provides publicly available benchmarks, datasets, and a straightforward multilingual baseline to accelerate inclusive video-language research across languages and cultures.

Abstract

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

TL;DR

ViMUL addresses the gap in multilingual, culturally aware video understanding by introducing ViMUL-Bench, a benchmark spanning 14 languages and 15 domains with 8K native-verified QA pairs across generic and cultural content. It also presents ViMUL, a multilingual video LMM built on a SigLIP-based vision encoder and a Qwen-2.0 language model, trained via a large-scale multilingual instruction-tuning dataset (1,238,102 translated samples) and evaluated with cycle-consistency checks. Results show that while closed-source GPT-4o dominates, ViMUL offers a competitive and more balanced performance across high- and low-resource languages, particularly on long, culturally nuanced videos. The work provides publicly available benchmarks, datasets, and a straightforward multilingual baseline to accelerate inclusive video-language research across languages and cultures.

Abstract

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

Paper Structure

This paper contains 44 sections, 1 equation, 21 figures, 5 tables.

Figures (21)

  • Figure 1: ViMUL-Bench consists of carefully curated videos spanning 14 languages, with 8K manually verified annotations by native experts. It covers 15 diverse domains, incorporating real-world cultural elements such as regional landmarks, local cuisines, and traditional festivals. Additionally, we introduce ViMUL, a simple multilingual baseline designed for general and cultural video comprehension. Qualitative examples (top: Sinhala and bottom: Bengali language) here show that ViMUL performs favorably against recent vidLMMs in cultural inclusivity and overall understanding (errors are highlighted in red and correct answer in green). ViMUL-Bench covers diverse questions, such as MCQs and short and long visual question answers (VQAs). (: ViLA, : Video-Chat2, : Video-ChatGPT, : LLaVA-OneVision-Qwen (OQ), : LLaVA-Next (LN), : Our ViMUL).
  • Figure 2: Benchmarking video LMMs on the proposed ViMUL-Bench across various languages and cultures. (a) Performance comparison of open-source versus closed-source models, with a distinction between low-resource and high-resource languages in our ViMUL-Bench. (b) Performance of different video LMMs across 15 diverse categories (both generic and cultural) in our ViMUL-Bench. The categories in black represents generic categories, and categories in blue represents the cultural categories.
  • Figure 3: Data collection and verification pipeline. Our benchmark consists of both cultural-specific video content curated from scratch (left) and generic Video-QA pairs sourced from existing video LMM benchmarks. Cultural videos are scrapped using a (country, language, sub-topic) triplet and manually filtered for relevance and private information. With the help of native speakers, we create QA pairs for each language from scratch (except English), with cultural QA pairs translated into English using GPT-4o. Our ViMUL-Bench has diverse question types and features approximately 8K QA pairs in 14 languages.
  • Figure 4: Overview of ViMUL. ViMUL is designed to comprehend and generate content in 14 different languages: Arabic, Bengali, Chinese, English, French, German, Hindi, Japanese, Russian, Sinhala, Spanish, Swedish, Tamil, and Urdu, covering at least two-thirds of the global population. The model employs a vision encoder to process video frames, followed by a vision-to-language projector and an LLM. The projected features are then concatenated with the user query and fed into the LLM to generate a response. (: frozen, : trained)
  • Figure 5: Performance comparison of video LMMs across 14 languages on ViMUL-Bench. Average accuracy is reported across all question types for each language. Each box represents a model’s accuracy for a specific language, with darker shades indicating higher accuracy. The results show that the closed-source model, GPT-4o, generally outperforms its open-source counterparts. In contrast to high-resource languages, methods struggle on low-resource languages (e.g., Sinhala, Urdu, Tamil). Among open-source models, our ViMUL provides a better tradeoff between high and low-resource languages, achieving an overall gain of 2% over LLaVA-OneVision.
  • ...and 16 more figures