H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

Qi Wu; Quanlong Zheng; Yanhao Zhang; Junlin Xie; Jinguo Luo; Kuo Wang; Peng Liu; Qingsong Xie; Ru Zhen; Zhenyu Yang; Haonan Lu

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

Qi Wu, Quanlong Zheng, Yanhao Zhang, Junlin Xie, Jinguo Luo, Kuo Wang, Peng Liu, Qingsong Xie, Ru Zhen, Zhenyu Yang, Haonan Lu

TL;DR

H2VU-Benchmark addresses critical gaps in video understanding evaluation by introducing a hierarchical, offline-online benchmark that spans long durations (up to 1.5 hours) and adds countercommonsense reasoning and trajectory state tracking, complemented by a large corpus of first-person streaming videos. It combines 10,183 tasks across 47 leaf abilities arranged in a three-tier hierarchy ($L$-1 to $L$-3), covering offline general and online streaming domains, and includes rigorous data curation to minimize prior-knowledge bias. Extensive zero-shot evaluations across commercial and open-source MLLMs reveal that while leading models achieve strong overall performance, tasks like state trajectory tracking and countercommonsense comprehension remain challenging, and online streaming scenarios require targeted training and architectural adaptations. The results highlight a need for debiasing priorknowledge and enhancing persistent perception to close gaps between offline and streaming video understanding, with H2VU as a practical resource to guide future research toward real-world, real-time video assistants and agents.

Abstract

With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

TL;DR

Abstract

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)