Table of Contents
Fetching ...

Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition

Zefeng Qian, Chongyang Zhang, Yifei Huang, Gang Wang, Jiangyong Ying

TL;DR

This work tackles Few-shot Action Recognition by explicitly extracting action-related foreground instances and integrating them with image features through a Joint Image-Instance Spatial-Temporal Attention framework (I2ST). The proposed Action-related Instance Perception module (IPM) is guided by a text-guided segmentation model (SEEM) to produce discriminative instance embeddings, which are fused with image features via Spatial-Temporal Attention to form robust video prototypes. A Global-Local Prototype Matching strategy further enhances test-time matching by combining global sequence information with local frame-level cues. Across five standard FSAR datasets, I2ST achieves state-of-the-art results, particularly in 1-shot and 3-shot settings, and demonstrates strong generalization in cross-dataset scenarios, while analyses show the importance of instance-level cues and the effectiveness of the fusion mechanism.

Abstract

Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images...

Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition

TL;DR

This work tackles Few-shot Action Recognition by explicitly extracting action-related foreground instances and integrating them with image features through a Joint Image-Instance Spatial-Temporal Attention framework (I2ST). The proposed Action-related Instance Perception module (IPM) is guided by a text-guided segmentation model (SEEM) to produce discriminative instance embeddings, which are fused with image features via Spatial-Temporal Attention to form robust video prototypes. A Global-Local Prototype Matching strategy further enhances test-time matching by combining global sequence information with local frame-level cues. Across five standard FSAR datasets, I2ST achieves state-of-the-art results, particularly in 1-shot and 3-shot settings, and demonstrates strong generalization in cross-dataset scenarios, while analyses show the importance of instance-level cues and the effectiveness of the fusion mechanism.

Abstract

Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images...

Paper Structure

This paper contains 31 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustration of our motivation. (a) Visualization of two similar actions: "playing ice hockey" and "ice skating". (b) Framework of prior works (Image Level). (c) Our Framework (Image-Instance Level). Compared to (b), our framework (c) explicitly perceives the action-related instances from the image and merges them via spatial-temporal Attention, generating more discriminative prototypes.
  • Figure 2: Overview of the proposed I$^2$ST. Given the support and query videos, we first utilize a feature extractor to encode the features of the images in the video. We then feed these image features into the instance perception module to extract instance embeddings and recover the action-related instance mask (the dashed line indicates that it was only used during the training phase). Then, the image features and the instance embeddings are fed into a Spatial-temporal Attention Module to merge foreground and background information of action videos across both temporal and spatial dimensions. Finally, the query video is classified based on the global-local matching results of the obtained video prototypes. For the convenience of illustration, other videos involved in a few-shot task are omitted from the figure.
  • Figure 3: The architecture details the proposed Instance Perception Module (IPM), including an Instance Perception Encoder and an Instance Perception Decoder.
  • Figure 4: Ablation study on the effect of changing the number of input video frames under the 5-way 1-shot SSv2-Full setting.
  • Figure 5: T-SNE distribution visualization of five action classes on test set of SSv2-Full. The different color represents video from different categories.
  • ...and 2 more figures