Table of Contents
Fetching ...

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

Yu Wang, Sanping Zhou, Kun Xia, Le Wang

TL;DR

The paper tackles semi-supervised action recognition under scarce labeled data, where actions with similar spatio-temporal cues are easily confused. It introduces a unified framework that combines Adaptive Contrastive Learning (ACL) for discriminative spatial representations with Multi-scale Temporal Learning (MTL) to capture long-term temporal structure, both built on a teacher-student EMA backbone. ACL builds class prototypes from labeled data, maintains a momentum memory bank of pseudo-labeled samples, and uses a Gaussian Mixture Model to derive reliability scores that adaptively select positive and negative samples for contrastive learning. MTL samples clips at multiple temporal scales, applies cross-scale temporal calibration to emphasize informative long-term semantics, and aligns them with short-term representations. Together, ACL and MTL achieve state-of-the-art results on UCF101, HMDB51, and Kinetics-400 across various labeling rates, without relying on extra modalities, and demonstrate strong gains on actions with ambiguous spatio-temporal cues.

Abstract

Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.

Learning Discriminative Spatio-temporal Representations for Semi-supervised Action Recognition

TL;DR

The paper tackles semi-supervised action recognition under scarce labeled data, where actions with similar spatio-temporal cues are easily confused. It introduces a unified framework that combines Adaptive Contrastive Learning (ACL) for discriminative spatial representations with Multi-scale Temporal Learning (MTL) to capture long-term temporal structure, both built on a teacher-student EMA backbone. ACL builds class prototypes from labeled data, maintains a momentum memory bank of pseudo-labeled samples, and uses a Gaussian Mixture Model to derive reliability scores that adaptively select positive and negative samples for contrastive learning. MTL samples clips at multiple temporal scales, applies cross-scale temporal calibration to emphasize informative long-term semantics, and aligns them with short-term representations. Together, ACL and MTL achieve state-of-the-art results on UCF101, HMDB51, and Kinetics-400 across various labeling rates, without relying on extra modalities, and demonstrate strong gains on actions with ambiguous spatio-temporal cues.

Abstract

Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.
Paper Structure (16 sections, 20 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 20 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Existing methods are prone to making ambiguous predictions for the actions with similar spatio-temporal semantics. From Figure \ref{['fig1']} (a), the model misrecognizes the action of "Nunchucks" as "Tai Chi" because of their similar spatial information. From Figure \ref{['fig1']} (b), it is also difficult to enable the model to distinguish between two actions of "High Jump" and "Long Jump" that have similar sub-actions and temporal structures.
  • Figure 2: An overview of the proposed Learning Discriminative Spatio-temporal Representations framework. It consists of three parts: (1) a basic framework, including a teacher model for providing pseudo-labels and a student model for online learning, (2) Adaptive Contrastive Learning (ACL), and (3) Multi-scale Temporal Learning (MTL). The labeled portion of the input consists of short-term clips from labeled samples, while the unlabeled portion consists of short-term clips and long-term clips at different scales from unlabeled samples.
  • Figure 3: Illustration of adaptive contrastive learning module. We determine the confidence of unlabeled samples and select positive and negative samples for them based on class prototypes.
  • Figure 4: Illustration of multi-scale temporal learning module. We calibrate long-term clips of different scales and align them with the short-term clip.
  • Figure 5: t-SNE of features on UCF-101 dataset with 1$\%$ labeled setting. The top row shows base features for 5 and 10 categories respectively whereas the bottom row shows our features after learning discriminative spatio-temporal representations. Dots of different colors represent different classes.
  • ...and 1 more figures