Table of Contents
Fetching ...

Cross-Model Cross-Stream Learning for Self-Supervised Human Action Recognition

Mengyuan Liu, Hong Liu, Tianyu Guo

TL;DR

This article first applies a new contrastive learning method called bootstrap your own latent (BYOL) to learn from skeleton data, and then forms SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition.

Abstract

Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, this paper first applies a new contrastive learning method called BYOL to learn from skeleton data, and then formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, this paper further presents a Cross-Model and Cross-Stream (CMCS) framework. This framework combines Cross-Model Adversarial Learning (CMAL) and Cross-Stream Collaborative Learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multi-stream information, CSCL is designed by generating similarity pseudo label of ensemble learning as supervision and guiding feature generation for individual streams. Extensive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that the proposed method can achieve better results than state-of-the-art methods using various evaluation protocols.

Cross-Model Cross-Stream Learning for Self-Supervised Human Action Recognition

TL;DR

This article first applies a new contrastive learning method called bootstrap your own latent (BYOL) to learn from skeleton data, and then forms SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition.

Abstract

Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, this paper first applies a new contrastive learning method called BYOL to learn from skeleton data, and then formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, this paper further presents a Cross-Model and Cross-Stream (CMCS) framework. This framework combines Cross-Model Adversarial Learning (CMAL) and Cross-Stream Collaborative Learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multi-stream information, CSCL is designed by generating similarity pseudo label of ensemble learning as supervision and guiding feature generation for individual streams. Extensive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that the proposed method can achieve better results than state-of-the-art methods using various evaluation protocols.
Paper Structure (12 sections, 8 equations, 4 figures, 12 tables, 1 algorithm)

This paper contains 12 sections, 8 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of our Cross-Model Cross-Stream (CMCS) framework, which mainly involves an Encoder block, a Cross-Model Adversarial Learning (CMAL) block, and a Cross-Stream Collaborative Learning (CSCL) block. (a) For a single stream, better intra-stream representation is learned by CMAL with "Attract" and "Repel" operations on features extracted from an Encoder. To facilitate observation, two skeleton sequences are colored in green and blue. The green dot means the feature extracted from the green skeleton sequence and the blue dot means the feature extracted from the blue skeleton sequence. (b) For multiple streams, CSAL aggregates inter-stream information and makes the feature space of a single stream to be consistent with the ensemble space. (Best viewed in color)
  • Figure 2: Comparison between our proposed Cross-Model Cross-Stream (CMCS) framework and our proposed simple baseline, i.e., SkeletonBYOL. Our CMCS framework consists of (a) Encoder, (c) CMAL, and (d) CSCL. Meanwhile, the SkeletonBYOL consists of (a) Encoder and (b) BYOL. Noting that the Encoder contains Augmentation, Online Network, and Target Network. The red line means "Attract" and the blue line means "Repel". (Best viewed in color)
  • Figure 3: Three key components of our proposed CSCL. (Best viewed in color)
  • Figure 4: The t-SNE visualization of embeddings on NTU-60 xview. Different methods extract features for 6 categories of samples, and the visualization results after feature dimensionality reduction are shown in the figure. Noting that 3s-CMAL means using CMAL on three streams. Compared with previous SkeletonCLR and 3s-CrosSCLR methods, our proposed SkeletonBYOL and CMAL do not show obvious improvements, meanwhile our proposed 3s-CMAL and CMCS methods show better performances by separating different colored dots more clearly from each other.