Table of Contents
Fetching ...

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Mubarak Shah

TL;DR

This paper tackles semi-supervised fine-grained action recognition (FGAR) by leveraging temporal-alignability to capture action phases. It introduces FinePseudo, a co-training framework that jointly uses a frame-wise alignability encoder $f_A$ and a video action encoder $f_E$ to refine pseudo-labels through collaborative predictions; the alignability model is trained with a differentiable distance $D(\mathbf{u}, \mathbf{v})$ via softDTW, a triplet loss $\mathcal{L}_{AT}$, and a learnable alignability-score $S(\mathbf{u}, \mathbf{v})$ optimized with $\mathcal{L}_{AV} = \mathcal{L}_{AT} + \omega \mathcal{L}_{Score}$. A Gaussian Infused Temporal Distinctiveness Loss (GITDL) pretrains $f_A$ on unlabeled data to encode intra-video action phases, after which collaborative pseudo-labeling combines $p_E$ (from $f_E$) and $p_A$ (from a non-parametric classifier $\phi_A$) to produce refined labels for self-training, with a confidence threshold $\theta$. The method achieves state-of-the-art or competitive results on fine-grained datasets (Diving48, FineGym99/288, FineDiving) and maintains strong performance on coarse-grained datasets (Kinetics400, Something-SomethingV2); it also demonstrates robustness in open-world settings by filtering novel classes via alignability scores. Overall, FinePseudo demonstrates that temporally-aware alignability signals can substantially improve pseudo-label quality and FGAR performance under limited labeling, with practical impact for real-world video understanding tasks that require precise action-phase discrimination.

Abstract

Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudo-labeling-based framework `\textit{FinePseudo}' significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups. Project Page: https://daveishan.github.io/finepsuedo-webpage/.

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

TL;DR

This paper tackles semi-supervised fine-grained action recognition (FGAR) by leveraging temporal-alignability to capture action phases. It introduces FinePseudo, a co-training framework that jointly uses a frame-wise alignability encoder and a video action encoder to refine pseudo-labels through collaborative predictions; the alignability model is trained with a differentiable distance via softDTW, a triplet loss , and a learnable alignability-score optimized with . A Gaussian Infused Temporal Distinctiveness Loss (GITDL) pretrains on unlabeled data to encode intra-video action phases, after which collaborative pseudo-labeling combines (from ) and (from a non-parametric classifier ) to produce refined labels for self-training, with a confidence threshold . The method achieves state-of-the-art or competitive results on fine-grained datasets (Diving48, FineGym99/288, FineDiving) and maintains strong performance on coarse-grained datasets (Kinetics400, Something-SomethingV2); it also demonstrates robustness in open-world settings by filtering novel classes via alignability scores. Overall, FinePseudo demonstrates that temporally-aware alignability signals can substantially improve pseudo-label quality and FGAR performance under limited labeling, with practical impact for real-world video understanding tasks that require precise action-phase discrimination.

Abstract

Real-life applications of action recognition often require a fine-grained understanding of subtle movements, e.g., in sports analytics, user interactions in AR/VR, and surgical videos. Although fine-grained actions are more costly to annotate, existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. Since fine-grained actions are more challenging due to the absence of scene bias, classifying these actions requires an understanding of action-phases. Hence, existing coarse-grained semi-supervised methods do not work effectively. In this work, we for the first time thoroughly investigate semi-supervised fine-grained action recognition (FGAR). We observe that alignment distances like dynamic time warping (DTW) provide a suitable action-phase-aware measure for comparing fine-grained actions, a concept previously unexploited in FGAR. However, since regular DTW distance is pairwise and assumes strict alignment between pairs, it is not directly suitable for classifying fine-grained actions. To utilize such alignment distances in a limited-label setting, we propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs. Our learnable alignability score provides a better phase-aware measure, which we use to refine the pseudo-labels of the primary video encoder. Our collaborative pseudo-labeling-based framework `\textit{FinePseudo}' significantly outperforms prior methods on four fine-grained action recognition datasets: Diving48, FineGym99, FineGym288, and FineDiving, and shows improvement on existing coarse-grained datasets: Kinetics400 and Something-SomethingV2. We also demonstrate the robustness of our collaborative pseudo-labeling in handling novel unlabeled classes in open-world semi-supervised setups. Project Page: https://daveishan.github.io/finepsuedo-webpage/.
Paper Structure (36 sections, 9 equations, 5 figures, 15 tables, 1 algorithm)

This paper contains 36 sections, 9 equations, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) Sample actions from standard coarse-grained action recognition dataset (UCF101) (b) Sample actions from fine-grained action recognition dataset (Diving48) (c) For proof-of-concept, we choose a binary classification problem of fine-grained actions, where the model has to predict whether the pair of videos belong to the same class or not. We consider Diving48 dataset with 10% training data. We first obtain the frame-wise video embedding from a pretrained framewise video-encoder model (Details in Sec. \ref{['sec:algo']}). The top part of (c) shows that the cosine distance computed at each timestamp does not provide a discriminative measure, whereas, DTW-based alignment cost provides a clear difference in pair of same vs different classes. The bottom part of (c), shows the performance of the binary classification task in terms of average precision, where our alignability-score significantly outperforms the other standard distances.
  • Figure 2: Alignability-Verification based Metric Learning is proposed to is proposed to decide how well two video instances are alignable and produce an 'alignability score' for effective learning from a limited labeled set $\mathbb{D}_{l}$. Our approach employs a triplet loss ($\mathcal{L}_{AT}$), considering videos from identical action classes as positive and those from different classes as negative. We selectively mine hard-negatives from the sampled minibatch based on alignment distance, presenting a challenging learning task for the model $f_A$. Additionally, we incorporate a matching loss $\mathcal{L}_{score}$ to quantify the alignment between videos, serving as a verification task to determine whether a video pair belongs to the same class (i.e. alignable or target label = 1) or different classes (i.e. non-alignable or target label = 0). Further details are provided in Sec. \ref{['sec:metric']}.
  • Figure 3: Collaborative Pseudo-labeling: The unlabeled instance $\mathbf{u}^{(i)}$ undergoes processing by both video encoders ($f_E$ and $f_A$). For the Action Encoder $f_E$, its prediction ($\mathbf{p}_E$) is derived via its classification head. For the Alignability Encoder $f_A$, the embedding of $\mathbf{u}^{(i)}$ computes class-wise alignability scores against a gallery of labeled embeddings $\mathbb{A}$. These scores are then used to generate a class-wise prediction $\mathbf{p}_{A}$ using the non-parametric classifier $\phi_A$. As these predictions stem from distinct supervisory signals—$\mathbf{p}_E$ from video-level and $\mathbf{p}_A$ from alignability-based supervision—they offer complementary insights, resulting in a refined collaborative pseudo-label.
  • Figure 4: Samples from the FineGym Dataset. FineGym offers a range of challenging, fine-grained action classes derived from gymnastic events. This figure showcases three action classes from the FineGym288 split. Here, each action class differs in the phase where different numbers of turns are executed.
  • Figure 5: Clip Sampling in the Proposed GITDL Framework. From a full video $\mathbf{V}^{(i)}$, we sample two types of clips: a global clip $\mathbf{G}^{(i)}$, which is sparsely sampled (skip rate = 2), and a local clip $\mathbf{L}^{(i)}$, which is densely sampled (skip rate = 1) within the temporal range of $\mathbf{G}^{(i)}$.