Table of Contents
Fetching ...

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Heqing Zou, Tianze Luo, Guiyang Xie, Victor Xiao Jie Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang

TL;DR

This work tackles the problem of hour-long video understanding, a regime underserved by existing benchmarks due to long-term dependency challenges and computational demands. It introduces HLV-1K, a large-scale benchmark comprising roughly 1,009 hour-long videos with 14,847 time-aligned QA/MCQA pairs spanning frame-level to long-term reasoning. The authors detail a four-stage data construction and labeling pipeline, including dense frame extraction, object-aware frame descriptions, sliding-window event labeling, and time-stamped QA/MCQA generation, followed by rigorous data filtering. Experimental results with commercial and open-source multimodal LLMs reveal strengths in specialized 72B models but also clear gaps in long-term temporal understanding, highlighting the benchmark’s potential to drive advances in time-specific long-video understanding and related applications.

Abstract

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

TL;DR

This work tackles the problem of hour-long video understanding, a regime underserved by existing benchmarks due to long-term dependency challenges and computational demands. It introduces HLV-1K, a large-scale benchmark comprising roughly 1,009 hour-long videos with 14,847 time-aligned QA/MCQA pairs spanning frame-level to long-term reasoning. The authors detail a four-stage data construction and labeling pipeline, including dense frame extraction, object-aware frame descriptions, sliding-window event labeling, and time-stamped QA/MCQA generation, followed by rigorous data filtering. Experimental results with commercial and open-source multimodal LLMs reveal strengths in specialized 72B models but also clear gaps in long-term temporal understanding, highlighting the benchmark’s potential to drive advances in time-specific long-video understanding and related applications.

Abstract

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.
Paper Structure (17 sections, 4 figures, 2 tables)

This paper contains 17 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: HLV-1K statistics: (a) Video category distribution, (b) Video duration distribution, and (c) Duration distribution of time-specific queries.
  • Figure 2: Construction of HLV-1K: (a) HLV-1K construction pipeline with data collection, data labeling and data filtering and revision, (b) Case of QA sample in HLV-1K and (b) Case of MCQA sample in HLV-1K.
  • Figure 3: Distribution of benchmark annotations.
  • Figure 4: Long video understanding evaluation results on HLV-1K under different QA tasks.