Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition
Zefeng Qian, Chongyang Zhang, Yifei Huang, Gang Wang, Jiangyong Ying
TL;DR
This work tackles Few-shot Action Recognition by explicitly extracting action-related foreground instances and integrating them with image features through a Joint Image-Instance Spatial-Temporal Attention framework (I2ST). The proposed Action-related Instance Perception module (IPM) is guided by a text-guided segmentation model (SEEM) to produce discriminative instance embeddings, which are fused with image features via Spatial-Temporal Attention to form robust video prototypes. A Global-Local Prototype Matching strategy further enhances test-time matching by combining global sequence information with local frame-level cues. Across five standard FSAR datasets, I2ST achieves state-of-the-art results, particularly in 1-shot and 3-shot settings, and demonstrates strong generalization in cross-dataset scenarios, while analyses show the importance of instance-level cues and the effectiveness of the fusion mechanism.
Abstract
Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images...
