Table of Contents
Fetching ...

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Allen He, Qi Liu, Kun Liu, Xinchen Liu, Wu Liu

Abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An overview of the classic pipeline for TSGV (top) vs ours fully end-to-end pipeline (bottom).
  • Figure 2: The proposed overall framework for fully end-to-end training for TSGV. (a) In the Sentence Conditioned Adapter, sentence embeddings guide backbone fine-tuning to enhance visual feature extraction. Efficiency is improved via a (b) simplified detector head and (c) video-center learning.
  • Figure 3: The architecture of sentence conditioned adapter. It contains of inner and out branches that leverage sentences to guide the fine-tuning of the visual backbone network, ultimately enhancing the quality of the visual features.
  • Figure 4: The effect of image and temporal resolution.
  • Figure 5: Comparison of training time and test performance for the number of concurrent queries trained on a single video. The training time for a single query is taken as the baseline and normalized to 1. We report Rank1@IoU0.7 on both datasets using C3D as the backbone.