Table of Contents
Fetching ...

T*: Re-thinking Temporal Search for Long-Form Video Understanding

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li

TL;DR

This work tackles the bottleneck of long-form video understanding by reframing temporal search as a spatial search problem and introducing LV-Haystack, a real-world benchmark with 15,092 QA instances across Ego4D and LongVideoBench. The authors propose T*, an iterative, zooming-in search framework that grounds questions, searches keyframes efficiently, and then feeds selected frames to vision-language models for downstream QA. Empirical results show substantial improvements in QA accuracy and efficiency across multiple VLMs (e.g., GPT-4o and LLaVA-OneVision-72B) with far fewer frames and reduced computational costs, especially on longer videos. The LV-Haystack benchmark and the T* framework together enable more scalable, interpretable, and resource-efficient long-form video understanding, with broad implications for video QA, indexing, and real-time analysis.

Abstract

Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.

T*: Re-thinking Temporal Search for Long-Form Video Understanding

TL;DR

This work tackles the bottleneck of long-form video understanding by reframing temporal search as a spatial search problem and introducing LV-Haystack, a real-world benchmark with 15,092 QA instances across Ego4D and LongVideoBench. The authors propose T*, an iterative, zooming-in search framework that grounds questions, searches keyframes efficiently, and then feeds selected frames to vision-language models for downstream QA. Empirical results show substantial improvements in QA accuracy and efficiency across multiple VLMs (e.g., GPT-4o and LLaVA-OneVision-72B) with far fewer frames and reduced computational costs, especially on longer videos. The LV-Haystack benchmark and the T* framework together enable more scalable, interpretable, and resource-efficient long-form video understanding, with broad implications for video QA, indexing, and real-time analysis.

Abstract

Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.

Paper Structure

This paper contains 64 sections, 16 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Long-form video understanding performance comparison on LongVideoBench wu2024longvideobench XL subset (900-3600s). Open-sourced model size is indicated by marker size. Our lightweight temporal search algorithm T* (§\ref{['sec:framework_overall']}) improve SOTA models significantly: GPT-4o (50.5% $\rightarrow~$53.1% and LLaVA-OneVision-OV-72B (56.5% $\rightarrow$62.4%), both with 32 frames.
  • Figure 2: The T* framework that employs efficient temporal search for long-form video understanding.T* employs an iterative temporal search approach to search keyframes essential to answer questions. Left: Question Grounding, where a visual language model identifies visual cues (target and cue object) from the textual question. Center: Iterative Temporal Search, formulated as Spatial Search where a spatial search model iteratively detects visual cues and upsamples relevant temporal/visual regions. Right: Downstream Task, where the visual language model answer questions using $K$ keyframes sampled from the final temporal search distribution as visual input.
  • Figure 3: Sampling weight dynamics over iterations for example videos. Ground truth frames are marked in red. Sampling weights progressively focus on ground truth frames across iterations (1, 11, and 21), indicating improved model alignment with keyframes over time. Notably, due to the efficient sampling in temporal search, our model can simultaneously zoom in and focus on distantly located key frames (e.g., around 50s and 100s in the top plot).
  • Figure 4: Performance improvement with increasing search frames. T* consistently enhances accuracy and reaches near-human oracle performance at 64 frames.
  • Figure 5: Grid Size Impact on Search Performance. The red line represents the average number of search iterations for different image grid configurations, while the blue line shows the performance on the LongVideoBench wu2024longvideobench XL subset using 8 frames and the LLaVA-72B as the downstream QA model.
  • ...and 8 more figures