Table of Contents
Fetching ...

Spatio-temporal Prompting Network for Robust Video Feature Extraction

Guanxiong Sun, Chi Wang, Zhaoyu Zhang, Jiankang Deng, Stefanos Zafeiriou, Yang Hua

TL;DR

The paper tackles frame quality deterioration in video understanding by proposing Spatio-Temporal Prompting Network (STPN), a lightweight, task-agnostic framework that injects spatio-temporal cues into the backbone via dynamic video prompts (DVPs) generated from nearby frames. STPN operates in two stages: predicting DVPs from support embeddings and prompting the current frame by prepending these prompts to patch embeddings before a shared transformer backbone, enabling robust feature extraction without task-specific modules. Two predictor designs, a transformer-based and a Mixer-based variant, generate $N_P$ prompts, and the approach generalizes across video object detection, video instance segmentation, and visual object tracking, achieving state-of-the-art results on ImageNet VID, YouTube-VIS, and GOT-10k. The method demonstrates strong speed-accuracy trade-offs and qualitative improvements (Grad-CAM and masking) in degraded video conditions, highlighting the practical impact of pre-backbone prompting for general video understanding.

Abstract

Frame quality deterioration is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

Spatio-temporal Prompting Network for Robust Video Feature Extraction

TL;DR

The paper tackles frame quality deterioration in video understanding by proposing Spatio-Temporal Prompting Network (STPN), a lightweight, task-agnostic framework that injects spatio-temporal cues into the backbone via dynamic video prompts (DVPs) generated from nearby frames. STPN operates in two stages: predicting DVPs from support embeddings and prompting the current frame by prepending these prompts to patch embeddings before a shared transformer backbone, enabling robust feature extraction without task-specific modules. Two predictor designs, a transformer-based and a Mixer-based variant, generate prompts, and the approach generalizes across video object detection, video instance segmentation, and visual object tracking, achieving state-of-the-art results on ImageNet VID, YouTube-VIS, and GOT-10k. The method demonstrates strong speed-accuracy trade-offs and qualitative improvements (Grad-CAM and masking) in degraded video conditions, highlighting the practical impact of pre-backbone prompting for general video understanding.

Abstract

Frame quality deterioration is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Code is available at https://github.com/guanxiongsun/vfe.pytorch.
Paper Structure (18 sections, 7 equations, 8 figures, 5 tables)

This paper contains 18 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparisons between pipelines of (a) existing methods and (b) the proposed Spatio-temporal Prompting Network (STPN). Existing methods introduce complex and task-specific integration modules ( ) after backbone networks (). In contrast, STPN is a unified framework for multiple tasks. A lightweight dynamic video prompt (DVP) predictor () generates a set of DVPs to adjust input before backbone networks (). Best viewed in colour.
  • Figure 2: (a) An overview of our approach. In the predicting stage, a set of dynamic visual prompts (DVPs) $\mathbf{P}$ is generated by the DVP predictor that takes support embeddings $\mathbf{E}_{sup}$ on support frames $I_{sup}$ as the input. Then, in the prompting stage, predicted DVPs are prepended with the patch embeddings of the current frame to extract spatio-temporal embeddings via a transformer encoder which contains $L$ transformer layers. Finally, different task heads take the spatio-temporal embeddings and output final results for various general video tasks, e.g., video object detection, video instance segmentation, and visual object tracking. (b) Details of the transformer-based predictor. (c) Details of the Mixer-based predictor.
  • Figure 3: A comparison between the (a) STPN shallow and the (b) STPN deep. The green rounded rectangles denote predicted DVPs. In (a), one set of DVPs is prepended to the patch embeddings before the first layer of the transformer encoder. In (b), $L$ sets of DVPs are predicted and then prepended to the input embeddings of all$L$ transformer layers in the transformer encoder.
  • Figure 4: Comparisons between the transformer-based and the Mixer-based DVP predictors.
  • Figure 5: Comparisons of STPN deep and STPN shallow on transformer encoders with different scales.
  • ...and 3 more figures