Table of Contents
Fetching ...

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

Qi Wu, Quanlong Zheng, Yanhao Zhang, Junlin Xie, Jinguo Luo, Kuo Wang, Peng Liu, Qingsong Xie, Ru Zhen, Zhenyu Yang, Haonan Lu

TL;DR

H2VU-Benchmark addresses critical gaps in video understanding evaluation by introducing a hierarchical, offline-online benchmark that spans long durations (up to 1.5 hours) and adds countercommonsense reasoning and trajectory state tracking, complemented by a large corpus of first-person streaming videos. It combines 10,183 tasks across 47 leaf abilities arranged in a three-tier hierarchy ($L$-1 to $L$-3), covering offline general and online streaming domains, and includes rigorous data curation to minimize prior-knowledge bias. Extensive zero-shot evaluations across commercial and open-source MLLMs reveal that while leading models achieve strong overall performance, tasks like state trajectory tracking and countercommonsense comprehension remain challenging, and online streaming scenarios require targeted training and architectural adaptations. The results highlight a need for debiasing priorknowledge and enhancing persistent perception to close gaps between offline and streaming video understanding, with H2VU as a practical resource to guide future research toward real-world, real-time video assistants and agents.

Abstract

With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

TL;DR

H2VU-Benchmark addresses critical gaps in video understanding evaluation by introducing a hierarchical, offline-online benchmark that spans long durations (up to 1.5 hours) and adds countercommonsense reasoning and trajectory state tracking, complemented by a large corpus of first-person streaming videos. It combines 10,183 tasks across 47 leaf abilities arranged in a three-tier hierarchy (-1 to -3), covering offline general and online streaming domains, and includes rigorous data curation to minimize prior-knowledge bias. Extensive zero-shot evaluations across commercial and open-source MLLMs reveal that while leading models achieve strong overall performance, tasks like state trajectory tracking and countercommonsense comprehension remain challenging, and online streaming scenarios require targeted training and architectural adaptations. The results highlight a need for debiasing priorknowledge and enhancing persistent perception to close gaps between offline and streaming video understanding, with H2VU as a practical resource to guide future research toward real-world, real-time video assistants and agents.

Abstract

With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.

Paper Structure

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Examples of each task in H²VU-Bench. Based on the video input and text prompt, Multi-Modal Language Models (MLLMs) are required to select the correct option.
  • Figure 2: (Left) Overview of ability dimensions in H²VU. Currently, H²VU incorporates three levels of ability dimensions (L-1 to L-3), encompassing 47 distinct leaf abilities. (Right) Video categories and Video duration length .H²VU covers 6 key domains, has a full spectrum of video length and covers different core abilities of MLLMs.
  • Figure 3: A comparison of commercial and open-source video large language models' performance on newly proposed tasks is shown. The figure details the average scores for each model on state trajectory tracking and countercommonsense comprehension.