Table of Contents
Fetching ...

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Xing Zhang, Jiaxi Gu, Haoyu Zhao, Shicong Wang, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu, Yu-Gang Jiang

TL;DR

AutoTVG introduces a novel vision-language pre-training paradigm for Temporal Video Grounding by jointly learning semantic alignment and boundary regression from automatically annotated untrimmed videos. The core components are Captioned Moment Generation (CMG), which builds captioned moments from video subtitles using CLIP-based noun/verb selection, and TVGNet, a regression-enabled grounding network trained with a dual-loss objective $L_{total} = L_{reg} + \lambda L_{guide}$ where $L_{reg}$ is a $Huber$ loss. Pre-training on a 30K subset of HowTo100M enables zero-shot evaluation on Charades-STA and ActivityNet Captions, achieving competitive results with far less data than prior methods and outperforming several weakly-supervised baselines due to the diversity of nouns/verbs from subtitles. The work demonstrates that untrimmed-video pre-training with CMG reduces the gap between pre-training and downstream TVG tasks, offering a practical path toward zero-shot temporal grounding with reduced annotation cost.

Abstract

Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation (CMG) module to generate captioned moments from untrimmed videos, and TVGNet with a regression head to predict localization results. Experimental results on Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal video grounding, AutoTVG achieves highly competitive performance with in-distribution methods under out-of-distribution testing, and is superior to existing pre-training frameworks with much less training data.

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

TL;DR

AutoTVG introduces a novel vision-language pre-training paradigm for Temporal Video Grounding by jointly learning semantic alignment and boundary regression from automatically annotated untrimmed videos. The core components are Captioned Moment Generation (CMG), which builds captioned moments from video subtitles using CLIP-based noun/verb selection, and TVGNet, a regression-enabled grounding network trained with a dual-loss objective where is a loss. Pre-training on a 30K subset of HowTo100M enables zero-shot evaluation on Charades-STA and ActivityNet Captions, achieving competitive results with far less data than prior methods and outperforming several weakly-supervised baselines due to the diversity of nouns/verbs from subtitles. The work demonstrates that untrimmed-video pre-training with CMG reduces the gap between pre-training and downstream TVG tasks, offering a practical path toward zero-shot temporal grounding with reduced annotation cost.

Abstract

Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional "pre-training + fine-tuning" paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation (CMG) module to generate captioned moments from untrimmed videos, and TVGNet with a regression head to predict localization results. Experimental results on Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal video grounding, AutoTVG achieves highly competitive performance with in-distribution methods under out-of-distribution testing, and is superior to existing pre-training frameworks with much less training data.
Paper Structure (20 sections, 1 equation, 7 figures, 13 tables)

This paper contains 20 sections, 1 equation, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Comparisons of the traditional "pre-training + fine-tuning" paradigm and the proposed AutoTVG. The traditional method follows a two-step strategy which pre-trains vision and text encoders with self-supervised loss and then fine-tunes a TVG model, while the proposed AutoTVG pre-trains encoders and a TVG model in a single step with untrimmed videos, so that can perform zero-shot testing.
  • Figure 2: An overview of our proposed method AutoTVG which consists of two main modules CMG and TVGNet. CMG module is for generating captioned moments from untrimmed videos by exploiting the speech in videos, the generated captioned moments are utilized for pre-training TVGNet.
  • Figure 3: The pipeline of CMG. Video moments are generated by clustering video frame features and their corresponding captions are decided by cross-modal alignment. To reduce the impact of noise from raw captions, only nouns and verbs are picked out to caption the resulting moments.
  • Figure 4: The structure of TVGNet. Two encoders are used for extracting features from input video and query text respectively, a contextual encoding module is for modeling global context-aware features, and cross-modal features are integrated through a cross-attention module. Finally, a timestamp is predicted by a regression head.
  • Figure 5: Visualization of the generated captioned moments and raw video captions.
  • ...and 2 more figures