Table of Contents
Fetching ...

TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

Junwen Pan, Rui Zhang, Xin Wan, Yuan Zhang, Ming Lu, Qi She

TL;DR

Long videos challenge LVLM-based understanding due to computation and frame-downsampling artifacts. The authors propose ZoomV, a hierarchical temporal search framework with TemporaLink (temporal-augmented representations) and TemporaLight (self-reflection-guided spotlighting) to identify and validate query-relevant moments, producing compact event representations. Across eight tasks and multiple backbones, ZoomV yields state-of-the-art results on hour-long benchmarks and strong improvements in temporal grounding, including notable VQA gains on ReXTime. The work demonstrates that self-reflection and coarse-to-fine temporal search can unlock robust, interpretable long-video understanding in end-to-end LVLM systems.

Abstract

Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8\% to 51.5\% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8\%. The code will be released.

TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

TL;DR

Long videos challenge LVLM-based understanding due to computation and frame-downsampling artifacts. The authors propose ZoomV, a hierarchical temporal search framework with TemporaLink (temporal-augmented representations) and TemporaLight (self-reflection-guided spotlighting) to identify and validate query-relevant moments, producing compact event representations. Across eight tasks and multiple backbones, ZoomV yields state-of-the-art results on hour-long benchmarks and strong improvements in temporal grounding, including notable VQA gains on ReXTime. The work demonstrates that self-reflection and coarse-to-fine temporal search can unlock robust, interpretable long-video understanding in end-to-end LVLM systems.

Abstract

Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8\% to 51.5\% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8\%. The code will be released.

Paper Structure

This paper contains 24 sections, 5 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of human-like interaction for long-video understanding. It divides hour-long videos into manageable sub-events and searches within query-aware segments.
  • Figure 1: Effectiveness and efficiency trade-off with confidence threshold $\epsilon$ and sub-event duration threshold $\Delta$.
  • Figure 2: An illustrative view of ZoomV. Equipped with ZoomV, Video LLM can gain enhanced capability for efficient and accurate long-video understanding.
  • Figure 2: Illustration of the subtle temporal dynamic challenge.
  • Figure 3: Illustration of TemporaLink.
  • ...and 7 more figures