Table of Contents
Fetching ...

VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang

TL;DR

VideoVista-CulturalLingo delivers the first video evaluation benchmark that jointly spans cultures, languages, and domains to assess multimodal video understanding. The authors introduce a hybrid automatic/human QA annotation pipeline across 14 tasks, covering Event, Object, Culture, and Science with 2 languages (Chinese and English) over 1,389 videos and 3,134 QA pairs. Comprehensive experiments on 24 LMMs reveal systematic weaknesses in open-source models for Chinese culture and math-related science questions, along with substantial temporal localization and cross-cultural generalization gaps versus proprietary models. The benchmark's scale, multilingual scope, and cross-domain coverage offer a rigorous, publicly usable platform to drive development of culturally aware, temporally adept video LMMs with real-world impact.

Abstract

Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.

VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

TL;DR

VideoVista-CulturalLingo delivers the first video evaluation benchmark that jointly spans cultures, languages, and domains to assess multimodal video understanding. The authors introduce a hybrid automatic/human QA annotation pipeline across 14 tasks, covering Event, Object, Culture, and Science with 2 languages (Chinese and English) over 1,389 videos and 3,134 QA pairs. Comprehensive experiments on 24 LMMs reveal systematic weaknesses in open-source models for Chinese culture and math-related science questions, along with substantial temporal localization and cross-cultural generalization gaps versus proprietary models. The benchmark's scale, multilingual scope, and cross-domain coverage offer a rigorous, publicly usable platform to drive development of culturally aware, temporally adept video LMMs with real-world impact.

Abstract

Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.

Paper Structure

This paper contains 62 sections, 33 figures, 7 tables.

Figures (33)

  • Figure 1: An example of Chinese Culture in VideoVista-CulturalLingo. The correct answer is highlighted in yellow.
  • Figure 2: (Left) Comprehensive statistics from different perspectives. The durations reported are based on the statistics from the 2,052 video clips. The question and answer length is count in tokens; (Right) Videos in VideoVista-CulturalLingo is sourced hundreds of domains from 3 popular video websites across the world. In the video sourced from Xiaohongshu (RedNote), we only present 42 of the all domains.
  • Figure 3: The three-stage annotation process of VideoVista-CulturalLingo.
  • Figure 4: The LMMs performance divided by Culture, Language and Duration. The Duration in (c): <2 minutes (Short), 2-10 minutes (Medium), >10 minutes (Long).
  • Figure 5: The LMMs performance divided by domains from 3 video sources:Gemini-2.0-Flash, GPT-4o, Qwen2.5-VL-72B, VideoLLaMA3, InternVideo2.5, MiniCPM-o 2.6. In Figures \ref{['fig:sub_ytb']} and Figures \ref{['fig:sub_xhs']}, we present only the 18 domains with the highest number of videos. In Figure \ref{['fig:sub_bilibili']}, we exclude domains containing fewer than 10 videos. The domains in these figures are represented by abbreviations, as described in Appendix \ref{['domain_abbreviations']}.
  • ...and 28 more figures