Table of Contents
Fetching ...

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang

TL;DR

Grounded-VideoLLM tackles the challenge of fine-grained temporal grounding in videos by introducing a two-stream encoding that separately captures spatial appearance and motion, and discrete temporal tokens that represent time stamps within a unified LLM framework. A three-stage progressive training regime aligns the video encoders and temporal tokens with the language model, complemented by a grounded VideoQA dataset to bolster temporal reasoning. The approach yields strong results across temporal sentence grounding, dense video captioning, and grounded VideoQA, while remaining capable on broader video understanding benchmarks. This work advances precise moment-level reasoning in Video-LLMs and offers a practical, adaptable framework for fine-grained video understanding tasks.

Abstract

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

TL;DR

Grounded-VideoLLM tackles the challenge of fine-grained temporal grounding in videos by introducing a two-stream encoding that separately captures spatial appearance and motion, and discrete temporal tokens that represent time stamps within a unified LLM framework. A three-stage progressive training regime aligns the video encoders and temporal tokens with the language model, complemented by a grounded VideoQA dataset to bolster temporal reasoning. The approach yields strong results across temporal sentence grounding, dense video captioning, and grounded VideoQA, while remaining capable on broader video understanding benchmarks. This work advances precise moment-level reasoning in Video-LLMs and offers a practical, adaptable framework for fine-grained video understanding tasks.

Abstract

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.
Paper Structure (20 sections, 3 equations, 5 figures, 13 tables)

This paper contains 20 sections, 3 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Grounded-VideoLLM enables Temporal Referring/Localizing/Reasoning for MLLMs.
  • Figure 2: Overview of Grounded-VideoLLM. For temporal modeling, we employ a segment-wise encoding strategy by decomposing each segment into a spatial part and a temporal part and encoding each respectively. For timestamp representation, we introduce additional special temporal tokens sharing a unified embedding space with LLM.
  • Figure 3: Examples of annotation pipeline and generated data for Grounded VideoQA.
  • Figure 4: Attention weights of the LLM when generating the temporal tokens and 3D-PCA of embeddings.
  • Figure 5: Visualization of temporal tokens with PCA.