Table of Contents
Fetching ...

ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering

Yuxiang Nie, Han Wang, Yongjie Ye, Haiyang Yu, Weitao Jia, Tao Zeng, Hao Feng, Xiang Fei, Yang Li, Xiaohui Lv, Guozhi Tang, Jingqun Tang, Jinghui Lu, Zehui Dai, Jiacong Wang, Dingkang Yang, An-Lan Wang, Can Huang

TL;DR

This work addresses the lack of Chinese-centric benchmarks for Multimodal Large Language Models in video understanding by introducing ChineseVideoBench, a large-scale, manually annotated Chinese video QA benchmark. It comprises 1,625 CC0 videos across 11 domains and 6,507 multiple-choice questions spanning 8 task categories and 12 sub-tasks, with audio removed to emphasize visual content. The authors provide a rigorous three-stage annotation pipeline and hierarchical task design to enable fine-grained diagnostic analysis, and they evaluate both proprietary and open-source MLLMs, finding that English-trained systems underperform on Chinese content while culturally-tuned models like InternVL-38B approach the top closed-source performers but still struggle with temporal localization and fine-grained reasoning. The benchmark reveals significant gaps in current MLLMs’ temporal grounding and cultural-context understanding, offering actionable insights to guide future research toward more robust, culturally aware video understanding capabilities in Chinese content.

Abstract

This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.

ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering

TL;DR

This work addresses the lack of Chinese-centric benchmarks for Multimodal Large Language Models in video understanding by introducing ChineseVideoBench, a large-scale, manually annotated Chinese video QA benchmark. It comprises 1,625 CC0 videos across 11 domains and 6,507 multiple-choice questions spanning 8 task categories and 12 sub-tasks, with audio removed to emphasize visual content. The authors provide a rigorous three-stage annotation pipeline and hierarchical task design to enable fine-grained diagnostic analysis, and they evaluate both proprietary and open-source MLLMs, finding that English-trained systems underperform on Chinese content while culturally-tuned models like InternVL-38B approach the top closed-source performers but still struggle with temporal localization and fine-grained reasoning. The benchmark reveals significant gaps in current MLLMs’ temporal grounding and cultural-context understanding, offering actionable insights to guide future research toward more robust, culturally aware video understanding capabilities in Chinese content.

Abstract

This paper introduces ChineseVideoBench, a pioneering benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering. The growing demand for sophisticated video analysis capabilities highlights the critical need for comprehensive, culturally-aware evaluation frameworks. ChineseVideoBench addresses this gap by providing a robust dataset and tailored evaluation metrics, enabling rigorous assessment of state-of-the-art MLLMs on complex Chinese video content. Specifically, ChineseVideoBench comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness. Our empirical evaluations reveal that ChineseVideoBench presents a significant challenge to current MLLMs. Among the models assessed, Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.

Paper Structure

This paper contains 30 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Performance comparison of open-source and closed-source MLLMs on ChineseVideoBench across eight tasks: world knowledge (WK), topic recognition (TR), scene understanding (SU), character recognition (CR), temporal localization (TL), object perception (OP), action recognition (AR), and logical reasoning (LR).
  • Figure 2: Construction pipeline of ChineseVideoBench. We employ a multi-tier annotation process conducted entirely by human annotators to construct the benchmark.
  • Figure 3: Representative QA examples from different tasks. Each example displays selected video frames, Chinese QA pairs, and corresponding English translations. Correct answer options are highlighted in red.
  • Figure 4: Question distribution of ChineseVideoBench. Left: Question distribution of main aspects and tasks; Right: Question distribution of sub-tasks; Numbers in square brackets indicate the number of questions for each aspect, task, and sub-task.
  • Figure 5: Left: Distribution of video durations; Right: Distribution of token lengths for questions and options across tasks, tokenized using GPT4o openai2024gpt4o. The tasks include world knowledge (WK), topic recognition (TR), scene understanding (SU), character recognition (CR), temporal localization (TL), object perception (OP), action recognition (AR), and logical reasoning (LR).
  • ...and 3 more figures