Hierarchical Compositional Representations for Few-shot Action Recognition
Changzhen Li, Jie Zhang, Shuzhe Wu, Xin Jin, Shiguang Shan
TL;DR
This work tackles few-shot action recognition by introducing hierarchical compositional representations (HCR) that decompose actions into $K$ sub-actions and $M$ SAS-actions, enabling transfer of fine-grained patterns from base to novel classes. A Parts Attention Module (PAM) yields explicit body-part–focused SAS-actions and implicit contextual SAS-actions, trained with a pose-prior constraint, while Earth Mover’s Distance (EMD) matches sub-action sequences to compute video similarity in a way that preserves intra-sub-action timing. The method is evaluated on HMDB51, UCF101, and Kinetics, achieving state-of-the-art or competitive results, with ablations highlighting the importance of PAM placement, EMD, sub-action granularity, and pretraining. Overall, HCR demonstrates that fine-grained compositional representations combined with optimal transport-based matching offer robust few-shot action recognition and improved transfer across datasets.
Abstract
Recently action recognition has received more and more attention for its comprehensive and practical applications in intelligent surveillance and human-computer interaction. However, few-shot action recognition has not been well explored and remains challenging because of data scarcity. In this paper, we propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition. Specifically, we divide a complicated action into several sub-actions by carefully designed hierarchical clustering and further decompose the sub-actions into more fine-grained spatially attentional sub-actions (SAS-actions). Although there exist large differences between base classes and novel classes, they can share similar patterns in sub-actions or SAS-actions. Furthermore, we adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations. It computes the optimal matching flows between sub-actions as distance metric, which is favorable for comparing fine-grained patterns. Extensive experiments show our method achieves the state-of-the-art results on HMDB51, UCF101 and Kinetics datasets.
