Table of Contents
Fetching ...

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Sidan Du

TL;DR

VTG-GPT is proposed, a GPT-based method for zero-shot VTG without training or fine-tuning that significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches and achieves competitive performance comparable to supervised methods.

Abstract

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

TL;DR

VTG-GPT is proposed, a GPT-based method for zero-shot VTG without training or fine-tuning that significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches and achieves competitive performance comparable to supervised methods.

Abstract

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on https://github.com/YoucanBaby/VTG-GPT
Paper Structure (16 sections, 4 equations, 7 figures, 6 tables)

This paper contains 16 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure S1: (a) An illustrative example of a video temporal grounding (VTG) task. (b) Previous methods require training for all modules. (c) Our proposed VTG-GPT operates without any training or fine-tuning. Moreover, it employs GPT to reduce bias in human-annotated queries.
  • Figure S2: Human biases in ground-truth queries arise from (a) misspelled words and (b) incorrect descriptions. Our approach effectively mitigates these biases by leveraging GPT to optimize raw queries.
  • Figure S3: An overview of our proposed VTG-GPT, the framework contains four key phases: query debiasing (Section \ref{['sec:query_debiasing']}), image captioning (Section \ref{['sec:image_captioning']}), proposal generation (Section \ref{['sec:proposal_generation']}), and post-processing (Section \ref{['sec:post_processing']}).
  • Figure S4: (a) An example of query refinement using Baichuan2. (b) An example of image captioning using MiniGPT-v2. The red font employed here is for demonstration purposes only and is not present in the actual code.
  • Figure S5: Visualization of predictions on QVHighlights val split. (a) misspelled words. (b) incorrect descriptions. Our VTG-GPT achieves more precise localization compared to Moment-DETR MomentDETR-2021, as it can correct errors in the original queries through rewriting and generate debiased queries, thereby facilitating more accurate grounding.
  • ...and 2 more figures