Table of Contents
Fetching ...

Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach

Yan Zhang, Lechao Cheng, Yaxiong Wang, Zhun Zhong, Meng Wang

TL;DR

This work addresses the annotation bottleneck in Micro-Action Recognition by proposing Semi-Supervised MAR (SSMAR) and an asynchronous learning framework, APLT, that decouples pseudo-label generation from model training. Phase I generates high-quality pseudo-labels via semi-supervised clustering with labeled augmentation and self-adaptive thresholds, feeding a memory-based prototype classifier. Phase II trains the model with a combined loss that leverages both the parametric classifier and the fixed prototypes, with alternating offline and online updates to reduce overfitting. Across three MAR benchmarks, APLT consistently outperforms state-of-the-art SSL methods, achieving substantial gains such as a 14.5 percentage point improvement over FixMatch on MA-12 with 50% labeled data, demonstrating the practical impact of asynchronous pseudo-labeling and non-parametric supervision in low-label regimes.

Abstract

Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14.5\% over FixMatch on the MA-12 dataset when using only 50\% labeled data. Code will be publicly available.

Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach

TL;DR

This work addresses the annotation bottleneck in Micro-Action Recognition by proposing Semi-Supervised MAR (SSMAR) and an asynchronous learning framework, APLT, that decouples pseudo-label generation from model training. Phase I generates high-quality pseudo-labels via semi-supervised clustering with labeled augmentation and self-adaptive thresholds, feeding a memory-based prototype classifier. Phase II trains the model with a combined loss that leverages both the parametric classifier and the fixed prototypes, with alternating offline and online updates to reduce overfitting. Across three MAR benchmarks, APLT consistently outperforms state-of-the-art SSL methods, achieving substantial gains such as a 14.5 percentage point improvement over FixMatch on MA-12 with 50% labeled data, demonstrating the practical impact of asynchronous pseudo-labeling and non-parametric supervision in low-label regimes.

Abstract

Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14.5\% over FixMatch on the MA-12 dataset when using only 50\% labeled data. Code will be publicly available.

Paper Structure

This paper contains 13 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: We present a new setting, Semi-Supervised Micro-Action Recognition (SSMAR), which aims to train a model that can recognize subtle, rapid micro-actions in videos by utilizing both labeled and unlabeled data.
  • Figure 2: (a) Micro-actions (MA-12) are less distinct from each other and more difficult to differentiate than conventional actions (UCF101 and HMDB51). (b) Comparison of training process between our method and Fixmatch on the traditional action recognition datasets and a MAR dataset with 50% labeled data. FixMatch performs well on traditional action recognition datasets (HMDB51 and UCF101). However, when applying it on a MAR dataset (MA-12), as training proceeds, the number of unlabeled samples that pass the set threshold for the online pseudo-labeling method gradually increases, and the accuracy of the pseudo-labeling gradually decreases. In contrast, our method consistently generates high-accurate pseudo-labels. (c) FixMatch performs a synchronous pseudo-labeling and training. Instead, the proposed asynchronous approach separates the pseudo-labeling from the training process, where the pseudo-labels are first obtained by semi-supervised clustering in the offline phase and are then utlized for the online model training.
  • Figure 3: Overview of the proposed APLT framework. APLT includes two phases: offline pseudo-labeling and online model training. During the offline phase, we propose an approach to generate reliable pseudo-labels by semi-supervised clustering and self-adaptive thresholding. In addition, we construct a memory-based prototype classifier by averaging features assigned with the same cluster. During the online phase, we augment samples for both labeled and unlabeled samples. For the labeled data, we use the ground-truth labels to supervise the two classifiers ($\mathcal{L}^{margin}_{sup}$ and $\mathcal{L}^{logits}_{sup}$). For the unlabeled data, we use the predictions of traditional classifier to supervise the same classifier ($\mathcal{L}^{logits}_{u}$) while use the pseudo-labels generated by the offline phase to supervise the prototype classifier ($\mathcal{L}^{margin}_{u}$). "WA" and "SA" stand for weak augmentation and strong augmentation, respectively.
  • Figure 4: Left: Class accuracy comparison between APLT with FixMatch for MA-12 with 10% and 50% labeled data. Right: Visualization of the predictions of APLT and FixMatch. The two methods are trained with ResNet-18.