Hierarchical Compositional Representations for Few-shot Action Recognition

Changzhen Li; Jie Zhang; Shuzhe Wu; Xin Jin; Shiguang Shan

Hierarchical Compositional Representations for Few-shot Action Recognition

Changzhen Li, Jie Zhang, Shuzhe Wu, Xin Jin, Shiguang Shan

TL;DR

This work tackles few-shot action recognition by introducing hierarchical compositional representations (HCR) that decompose actions into $K$ sub-actions and $M$ SAS-actions, enabling transfer of fine-grained patterns from base to novel classes. A Parts Attention Module (PAM) yields explicit body-part–focused SAS-actions and implicit contextual SAS-actions, trained with a pose-prior constraint, while Earth Mover’s Distance (EMD) matches sub-action sequences to compute video similarity in a way that preserves intra-sub-action timing. The method is evaluated on HMDB51, UCF101, and Kinetics, achieving state-of-the-art or competitive results, with ablations highlighting the importance of PAM placement, EMD, sub-action granularity, and pretraining. Overall, HCR demonstrates that fine-grained compositional representations combined with optimal transport-based matching offer robust few-shot action recognition and improved transfer across datasets.

Abstract

Recently action recognition has received more and more attention for its comprehensive and practical applications in intelligent surveillance and human-computer interaction. However, few-shot action recognition has not been well explored and remains challenging because of data scarcity. In this paper, we propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition. Specifically, we divide a complicated action into several sub-actions by carefully designed hierarchical clustering and further decompose the sub-actions into more fine-grained spatially attentional sub-actions (SAS-actions). Although there exist large differences between base classes and novel classes, they can share similar patterns in sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations. It computes the optimal matching flows between sub-actions as distance metric, which is favorable for comparing fine-grained patterns. Extensive experiments show our method achieves the state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.

Hierarchical Compositional Representations for Few-shot Action Recognition

TL;DR

This work tackles few-shot action recognition by introducing hierarchical compositional representations (HCR) that decompose actions into

sub-actions and

SAS-actions, enabling transfer of fine-grained patterns from base to novel classes. A Parts Attention Module (PAM) yields explicit body-part–focused SAS-actions and implicit contextual SAS-actions, trained with a pose-prior constraint, while Earth Mover’s Distance (EMD) matches sub-action sequences to compute video similarity in a way that preserves intra-sub-action timing. The method is evaluated on HMDB51, UCF101, and Kinetics, achieving state-of-the-art or competitive results, with ablations highlighting the importance of PAM placement, EMD, sub-action granularity, and pretraining. Overall, HCR demonstrates that fine-grained compositional representations combined with optimal transport-based matching offer robust few-shot action recognition and improved transfer across datasets.

Abstract

Paper Structure (20 sections, 8 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 5 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Action recognition
Few-shot learning
Compositional representation learning
Few-shot action recognition
Method
Pipeline
Hierarchical compositional representations
Distance metric
Earth Mover’s Distance
EMD for few-shot action recognition
Discussions
Implementation details
Experiments
...and 5 more sections

Figures (5)

Figure 1: Although there exist differences between base classes and novel classes, they can share basic patterns in common, e.g., sub-actions and SAS-actions.
Figure 2: The pipeline. The whole video is first clustered into flexible sub-actions. Each sub-action extracts the corresponding spatio-temporal representations by Feature Encoder. In this process, we regard each channel's output of the Parts Attention Module (PAM) as a SAS-action, and these SAS-actions are further divided into explicit SAS-actions and implicit SAS-actions. The former pays attention to pre-defined human body parts by parts prior constraint, while the latter pays attention to other action-relevant cues like context. Finally, the EMD distance is adopted to measure the distance of sub-action representation sequences between support and query videos.
Figure 3: The Parts Attention Module (PAM) architecture. The SAS-actions pay attention to various regions of interest by employing PAM, and especially, explicit SAS-actions focus on pre-defined body parts by parts prior constraint.
Figure 4: Accuracy comparisons of various sub-action numbers in the 5-way 1-shot setting on HMDB51 (left), UCF101 (middle) and Kinetics (right)
Figure 5: The visualization results. The PAM restricts SAS-actions to pays attention to specific regions of interest.

Hierarchical Compositional Representations for Few-shot Action Recognition

TL;DR

Abstract

Hierarchical Compositional Representations for Few-shot Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)