Table of Contents
Fetching ...

Video-to-Task Learning via Motion-Guided Attention for Few-Shot Action Recognition

Hanyu Guo, Wanchuan Yu, Suzhou Que, Kaiwen Du, Yan Yan, Hanzi Wang

TL;DR

This paper proposes a novel Dual Motion-Guided Attention Learning method (called DMGAL) for few-shot action recognition, aiming to learn the spatio-temporal relationships from the video-specific to the task-specific level and validate the effectiveness of this method by employing both fully fine-tuning and adapter-tuning paradigms.

Abstract

In recent years, few-shot action recognition has achieved remarkable performance through spatio-temporal relation modeling. Although a wide range of spatial and temporal alignment modules have been proposed, they primarily address spatial or temporal misalignments at the video level, while the spatio-temporal relationships across different videos at the task level remain underexplored. Recent studies utilize class prototypes to learn task-specific features but overlook the spatio-temporal relationships across different videos at the task level, especially in the spatial dimension, where these relationships provide rich information. In this paper, we propose a novel Dual Motion-Guided Attention Learning method (called DMGAL) for few-shot action recognition, aiming to learn the spatio-temporal relationships from the video-specific to the task-specific level. To achieve this, we propose a carefully designed Motion-Guided Attention (MGA) method to identify and correlate motion-related region features from the video level to the task level. Specifically, the Self Motion-Guided Attention module (S-MGA) achieves spatio-temporal relation modeling at the video level by identifying and correlating motion-related region features between different frames within a video. The Cross Motion-Guided Attention module (C-MGA) identifies and correlates motion-related region features between frames of different videos within a specific task to achieve spatio-temporal relationships at the task level. This approach enables the model to construct class prototypes that fully incorporate spatio-temporal relationships from the video-specific level to the task-specific level. We validate the effectiveness of our DMGAL method by employing both fully fine-tuning and adapter-tuning paradigms. The models developed using these paradigms are termed DMGAL-FT and DMGAL-Adapter, respectively.

Video-to-Task Learning via Motion-Guided Attention for Few-Shot Action Recognition

TL;DR

This paper proposes a novel Dual Motion-Guided Attention Learning method (called DMGAL) for few-shot action recognition, aiming to learn the spatio-temporal relationships from the video-specific to the task-specific level and validate the effectiveness of this method by employing both fully fine-tuning and adapter-tuning paradigms.

Abstract

In recent years, few-shot action recognition has achieved remarkable performance through spatio-temporal relation modeling. Although a wide range of spatial and temporal alignment modules have been proposed, they primarily address spatial or temporal misalignments at the video level, while the spatio-temporal relationships across different videos at the task level remain underexplored. Recent studies utilize class prototypes to learn task-specific features but overlook the spatio-temporal relationships across different videos at the task level, especially in the spatial dimension, where these relationships provide rich information. In this paper, we propose a novel Dual Motion-Guided Attention Learning method (called DMGAL) for few-shot action recognition, aiming to learn the spatio-temporal relationships from the video-specific to the task-specific level. To achieve this, we propose a carefully designed Motion-Guided Attention (MGA) method to identify and correlate motion-related region features from the video level to the task level. Specifically, the Self Motion-Guided Attention module (S-MGA) achieves spatio-temporal relation modeling at the video level by identifying and correlating motion-related region features between different frames within a video. The Cross Motion-Guided Attention module (C-MGA) identifies and correlates motion-related region features between frames of different videos within a specific task to achieve spatio-temporal relationships at the task level. This approach enables the model to construct class prototypes that fully incorporate spatio-temporal relationships from the video-specific level to the task-specific level. We validate the effectiveness of our DMGAL method by employing both fully fine-tuning and adapter-tuning paradigms. The models developed using these paradigms are termed DMGAL-FT and DMGAL-Adapter, respectively.

Paper Structure

This paper contains 30 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison with previous methods. (a) Traditional methods design spatial and temporal alignment modules only to improve performance at the video level. (b) Recent works have focused on utilizing class prototypes to learn task-specific features, overlooking the spatio-temporal relationships between different videos at the task level. (c) Our method conducts spatio-temporal relation modeling for video-to-task learning, aiming to identify and correlate motion-related region features from the video level to the task level.
  • Figure 2: The overview of DMGAL-FT and details of our proposed MGA. (a) Overview of the DMGAL-FT model designed for the fully fine-tuning paradigm. DMGAL-FT uses S-MGA and C-MGA as additional modules to enable the model to sequentially learn spatio-temporal relationships at the video level and the task level, thereby achieving video-to-task learning via motion-guided attention within the fully fine-tuning paradigm. (b) Self Motion-Guided Attention module (S-MGA). S-MGA focuses on learning spatio-temporal relationships within a video, identifying and correlating motion-related region features in a video-specific manner. (c) Cross Motion-Guided Attention module (C-MGA). C-MGA focuses on learning spatio-temporal relationships within a task, identifying and correlating motion-related region features in a task-specific manner.
  • Figure 3: The overview of DMGAL-Adapter (a) Overview of the DMGAL-Adapter model designed for the adapter-tuning paradigm. DMGAL-Adapter selectively plugs the Smga-Adapter into all the early layers of a pre-trained model, reserving the final layer for the Cmga-Adapter. (b) Details of the Smga-Adapter and Cmga-Adapter, which are simplified versions of S-MGA and C-MGA, respectively.
  • Figure 4: Visualization of the cross-association ability of S-MGA on four examples using the UCF and SSv2-small datasets. For better visualization, we downsample the total number of patches to $3 \times 3 = 9$ patches. The vertical axis represents the patches of $F_i$ ($3 \times 3 = 9$ grid flattened to 9 patches), while the horizontal axis represents the patches of $F_{i+1}$ ($3 \times 3 = 9$ grid flattened to 9 patches). Brighter colors indicate higher similarity.
  • Figure 5: The attention map visualization of S-MGA identifies and correlates motion-related regions at the video level. The left side displays the attention map visualization for the UCF dataset, and the right side for the SSv2-small dataset.
  • ...and 4 more figures