Table of Contents
Fetching ...

SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Wenbo Huang, Jinghui Zhang, Xuwei Qian, Zhen Wu, Meng Wang, Lei Zhang

TL;DR

This work tackles the challenge of few-shot action recognition in high frame-rate videos by addressing subtle spatio-temporal relations and motion information density. It introduces SOAP-Net, a plug-and-play architecture with three parallel enhancement modules—3-Dimension Enhancement Module (3DEM), Channel-Wise Enhancement Module (CWEM), and Hybrid Motion Enhancement Module (HMEM)—together enabling joint spatio-temporal relation construction and comprehensive motion capture through frame tuples. Prototypes are built using triple priors to improve support-query representations, and a robust training objective ensures effective episodic learning. Empirical results on major FSAR benchmarks demonstrate state-of-the-art performance, strong generalization across backbones and modalities, and notable robustness to frame-rate variations and noise, highlighting SOAP's practical impact for real-world video understanding tasks.

Abstract

High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

TL;DR

This work tackles the challenge of few-shot action recognition in high frame-rate videos by addressing subtle spatio-temporal relations and motion information density. It introduces SOAP-Net, a plug-and-play architecture with three parallel enhancement modules—3-Dimension Enhancement Module (3DEM), Channel-Wise Enhancement Module (CWEM), and Hybrid Motion Enhancement Module (HMEM)—together enabling joint spatio-temporal relation construction and comprehensive motion capture through frame tuples. Prototypes are built using triple priors to improve support-query representations, and a robust training objective ensures effective episodic learning. Empirical results on major FSAR benchmarks demonstrate state-of-the-art performance, strong generalization across backbones and modalities, and notable robustness to frame-rate variations and noise, highlighting SOAP's practical impact for real-world video understanding tasks.

Abstract

High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.
Paper Structure (44 sections, 19 equations, 12 figures, 13 tables, 3 algorithms)

This paper contains 44 sections, 19 equations, 12 figures, 13 tables, 3 algorithms.

Figures (12)

  • Figure 1: Spatio-temporal relation and motion information density of HFR video frames are much subtler, reflecting by timeline and displacement. Therefore, larger amounts of samples are required for data-driven training.
  • Figure 2: Overview of the SOAP-Net. It comprises three main modules: the 3DEM for constructing relation between spatial and temporal information, the CWEM for modeling temporal connections between channels, and the HMEM for capturing comprehensive motion information with frame tuples of varying frame counts using a hybrid approach. The “$\bar{\textbf{A}}$” symbol at the right part of the figure shows an averaging calculation used to construct a query-specific prototype $P^{c}$ in Eqn \ref{['eq18']}.
  • Figure 3: The structure of 3-Dimension Enhancement Module (3DEM). The “$\bar{\textbf{A}}$” at left part means the averaging calculation in Eqn \ref{['eq3']}.
  • Figure 4: The structure of Channel-Wise Enhancement Module (CWEM).
  • Figure 5: The structure of Hybrid Motion Enhancement Module (HMEM), where $\mathcal{O}=\left\{ 1,2,3 \right\}$ and “Concat” denotes for concatenate operation.
  • ...and 7 more figures