FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition
Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Mubarak Shah
TL;DR
This paper tackles semi-supervised fine-grained action recognition (FGAR) by leveraging temporal-alignability to capture action phases. It introduces FinePseudo, a co-training framework that jointly uses a frame-wise alignability encoder $f_A$ and a video action encoder $f_E$ to refine pseudo-labels through collaborative predictions; the alignability model is trained with a differentiable distance $D(\mathbf{u}, \mathbf{v})$ via softDTW, a triplet loss $\mathcal{L}_{AT}$, and a learnable alignability-score $S(\mathbf{u}, \mathbf{v})$ optimized with $\mathcal{L}_{AV} = \mathcal{L}_{AT} + \omega \mathcal{L}_{Score}$. A Gaussian Infused Temporal Distinctiveness Loss (GITDL) pretrains $f_A$ on unlabeled data to encode intra-video action phases, after which collaborative pseudo-labeling combines $p_E$ (from $f_E$) and $p_A$ (from a non-parametric classifier $\phi_A$) to produce refined labels for self-training, with a confidence threshold $\theta$. The method achieves state-of-the-art or competitive results on fine-grained datasets (Diving48, FineGym99/288, FineDiving) and maintains strong performance on coarse-grained datasets (Kinetics400, Something-SomethingV2); it also demonstrates robustness in open-world settings by filtering novel classes via alignability scores. Overall, FinePseudo demonstrates that temporally-aware alignability signals can substantially improve pseudo-label quality and FGAR performance under limited labeling, with practical impact for real-world video understanding tasks that require precise action-phase discrimination.
Abstract
Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudo-labeling-based framework `\textit{FinePseudo}' significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups. Project Page: https://daveishan.github.io/finepsuedo-webpage/.
