Table of Contents
Fetching ...

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Andong Deng, Zhongpai Gao, Anwesa Choudhuri, Benjamin Planche, Meng Zheng, Bin Wang, Terrence Chen, Chen Chen, Ziyan Wu

TL;DR

This paper proposes Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos, and introduces a novel time representation that unifies positional information across image sequences, clip sequences, and long videos.

Abstract

Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6% improvement in F1 score and 44.8% in CIDEr on the YouCook2 benchmark and a 14.7% increase in recall on the Charades-STA benchmark compared to the baseline.

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

TL;DR

This paper proposes Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos, and introduces a novel time representation that unifies positional information across image sequences, clip sequences, and long videos.

Abstract

Temporal awareness is essential for video large language models (LLMs) to understand and reason about events within long videos, enabling applications like dense video captioning and temporal video grounding in a unified system. However, the scarcity of long videos with detailed captions and precise temporal annotations limits their temporal awareness. In this paper, we propose Seq2Time, a data-oriented training paradigm that leverages sequences of images and short video clips to enhance temporal awareness in long videos. By converting sequence positions into temporal annotations, we transform large-scale image and clip captioning datasets into sequences that mimic the temporal structure of long videos, enabling self-supervised training with abundant time-sensitive data. To enable sequence-to-time knowledge transfer, we introduce a novel time representation that unifies positional information across image sequences, clip sequences, and long videos. Experiments demonstrate the effectiveness of our method, achieving a 27.6% improvement in F1 score and 44.8% in CIDEr on the YouCook2 benchmark and a 14.7% increase in recall on the Charades-STA benchmark compared to the baseline.

Paper Structure

This paper contains 22 sections, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Illustration of our proposed Seq2Time. While existing approaches ren2024timechatguo2024vtgllm rely on timestamp-annotated long videos, Seq2Time leverages image and short video datasets. We introduce a novel time representation of unified relative position token that bridges different sequence types by encoding positions as 4-digit codes---each digit becomes a learnable embedding in the language space, enabling seamless knowledge transfer in the LLM embedding space.
  • Figure 2: Image sequence data in Seq2Time, featuring three complementary pretext tasks designed to leverage index-caption correspondence. IIG (Image Index Grounding) mimics temporal grounding through position prediction, IIC (Indexed Image Captioning) parallels dense video captioning, and ALR (Adjacent Location Reasoning) enhances sequential understanding through neighbor relationships.
  • Figure 3: Clip sequence data in Seq2Time. We use LongVA to generate captions for short video clips from Kinetics-700, then combine multiple clips from different action categories to simulate longer videos. Temporal annotations are derived from sequence positions of clips. The resulting sequences serve as training data for both temporal grounding and dense video captioning tasks.
  • Figure 4: Qualitative examples of our Seq2Time on TimeChat. IS: image sequence data, CS: clip sequence data, MC: more video captions. Text highlighted in red shows repetitive patterns in outputs, while brown indicates incorrect predictions in timestamps or event descriptions.
  • Figure 5: Illustration of an example from LLaVA-ReCap-CC3M. The caption provide details in every aspects of the image from foreground to background. For instance, even the motorcycle in the background is captured. The important descriptions are indicated in green texts.
  • ...and 13 more figures