Table of Contents
Fetching ...

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

Shihao Wang, Guo Chen, De-an Huang, Zhiqi Li, Minghan Li, Guilin Li, Jose M. Alvarez, Lei Zhang, Zhiding Yu

TL;DR

VideoITG introduces instruction-aligned frame sampling for Video-LLMs via the VidThinker annotation pipeline, producing the VideoITG-40K dataset with 40K videos and 500K grounding annotations. It then offers plug-and-play VideoITG models with three grounding variants, demonstrating consistent improvements on long-video benchmarks by aligning frame selection with user instructions. The results highlight the importance of instruction-guided temporal reasoning and the potential of video-language alignment to enhance long-video understanding. The work provides a scalable framework for efficient, targeted frame selection that can outperform larger models using naive sampling strategies.

Abstract

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding

TL;DR

VideoITG introduces instruction-aligned frame sampling for Video-LLMs via the VidThinker annotation pipeline, producing the VideoITG-40K dataset with 40K videos and 500K grounding annotations. It then offers plug-and-play VideoITG models with three grounding variants, demonstrating consistent improvements on long-video benchmarks by aligning frame selection with user instructions. The results highlight the importance of instruction-guided temporal reasoning and the potential of video-language alignment to enhance long-video understanding. The work provides a scalable framework for efficient, targeted frame selection that can outperform larger models using naive sampling strategies.

Abstract

Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the VidThinker annotation pipeline for VideoITG. The pipeline consists of three stages that fully leverage the provided instructions: (1) segment-level clip captioning; (2) instruction-guided relevant clip retrieval; (3) fine-grained frame-level localization.
  • Figure 2: Illustration of four instruction types and their corresponding frame selection strategies in VidThinker. For semantic-focused instructions, the system selects diverse frames capturing key visual clues. For motion-focused instructions, frames are uniformly sampled to capture dynamic changes. When both semantic and motion cues are required, a hybrid sampling strategy is applied. For vague or open-ended instructions, the system samples a minimal yet diverse set of frames across the video for holistic coverage.
  • Figure 3: VideoITG model design: (a) Text generation aligns video and language tokens for sequential predictions. (b) Classification with causal attention utilizes anchor tokens for temporal cue management. (c) Classification with full attention facilitates interaction across visual and text tokens without anchors.
  • Figure 4: Two examples of how different sampling strategies impact video understanding. We mark the identified key frames that directly answer the question with green check-marks.
  • Figure 5: Example-1 shows how different sampling strategies impact video understanding. We mark the identified key frames that directly answer the question with green check-marks.
  • ...and 1 more figures