Table of Contents
Fetching ...

MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

Purui Bai, Tao Wu, Jiayang Sun, Xinyue Liu, Huaibo Huang, Ran He

Abstract

The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.

MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

Abstract

The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.
Paper Structure (15 sections, 4 equations, 7 figures, 8 tables)

This paper contains 15 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Challenges in managing Multi-Video inputs. Previous benchmarks are primarily designed to address straightforward visual text problems, such as describing the content and associations of static images. While video description provides additional information to enhance the detail of content descriptions, it remains challenging to effectively manage comparisons between multiple related videos.
  • Figure 2: Design principles of MVPBench. A key feature of MVPBench is its ability to consider both the multiplicity of evaluation inputs and the temporal characteristics of evaluation videos.
  • Figure 3: Statistics of MVPBench. The benchmark includes 14 tasks in 5 domains, ranging from low-level pattern comparison to mid-level temporal logic reasoning, and extending to high-level visual content understanding.
  • Figure 4: Task introduction of MVPBench. Zoom in for better view.
  • Figure 5: More visualization of data in MVPBench. Examples of Temporal Segment Splicing tasks and Content Assessment tasks.
  • ...and 2 more figures