Table of Contents
Fetching ...

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin, Heng Fan, Libo Zhang

TL;DR

Spatio-temporal video grounding requires locating the target described by language within untrimmed videos. Traditional Transformer-based STVG methods rely on zero-initialized object queries, which can hinder learning in the presence of distractors and occlusion. The authors propose TA-STVG, a Target-Aware Transformer that generates target-aware object queries from the video-text pair through two cascaded modules: Text-Guided Temporal Sampling (TTS) and Attribute-aware Spatial Activation (ASA). Across HCSTVG-v1/v2 and VidSTG, TA-STVG achieves state-of-the-art performance and demonstrates generality when integrated with other Transformer-based STVG backbones, with ablations confirming the additive benefits of TTS and ASA and analyses highlighting efficiency and limitations.

Abstract

Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (\e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy.

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

TL;DR

Spatio-temporal video grounding requires locating the target described by language within untrimmed videos. Traditional Transformer-based STVG methods rely on zero-initialized object queries, which can hinder learning in the presence of distractors and occlusion. The authors propose TA-STVG, a Target-Aware Transformer that generates target-aware object queries from the video-text pair through two cascaded modules: Text-Guided Temporal Sampling (TTS) and Attribute-aware Spatial Activation (ASA). Across HCSTVG-v1/v2 and VidSTG, TA-STVG achieves state-of-the-art performance and demonstrates generality when integrated with other Transformer-based STVG backbones, with ablations confirming the additive benefits of TTS and ASA and analyses highlighting efficiency and limitations.

Abstract

Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (\e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy.

Paper Structure

This paper contains 28 sections, 11 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Comparison between existing Transformer-based STVG methods applying zero-initialized queries for STVG in (a) and our proposed Target-Aware Transformer-based STVG generating queries with target-aware cues from video and text for STVG in (b). Best viewed in color for all figures.
  • Figure 2: Comparison of the zero-initialized queries and groundtruth-generated queries for STVG. We see the target-specific information in groundtruth largely enhances results.
  • Figure 3: Overview of TA-STVG, which exploits target-specific information from the video and text ( i.e., features from multimodal encoder) for generating spatial and temporal object queries for STVG.
  • Figure 4: Illustration of the architecture for TTS in (a) and ASA in (b).
  • Figure 5: Illustration of temporal relevance score $s$ by TTS in (a) and attribution-aware spatial activation (in partial selected frames) including appearance and motion activation in (b) and (c). We can see from (a) that TTS can accurately select target-relevant frames, and from (b) and (c) that ASA precisely localizes attributes, e.g., color "yellow" and action "walks in, stops", related to the target.
  • ...and 9 more figures