Table of Contents
Fetching ...

BID: Boundary-Interior Decoding for Unsupervised Temporal Action Localization Pre-Trainin

Qihang Fang, Chengcheng Tang, Shugao Ma, Yanchao Yang

TL;DR

This work proposes the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments and shows results out-performing SOTA methods by a large margin.

Abstract

Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.

BID: Boundary-Interior Decoding for Unsupervised Temporal Action Localization Pre-Trainin

TL;DR

This work proposes the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments and shows results out-performing SOTA methods by a large margin.

Abstract

Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.
Paper Structure (26 sections, 9 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: We propose an unsupervised method to partition a skeletal motion to semantically meaningful action segments, which we name as pre-action segments and classes to differentiate from those defined by human annotators. We study how the discovery of pre-actions can help improve label efficiency with fine-tuning on limited human labels, for temporal action localization.
  • Figure 2: An overview of the proposed framework, which consists of a motion encoding function $\mathcal{E}$, an interior decoder $\mathcal{U}$, and a boundary decoder $\mathcal{B}$. The motion encoding function $\mathcal{E}$ is composed of a linear encoder $E$ and a residual VQ (RVQ) module, which encodes the input motion sequence $S$ into discrete latent codes. The frames with the same discrete latent code predicted by the VQ layer are considered with the same action class. The summation of the discrete latent codes from the VQ layer and the RVQ layers is used as the class representation of the motion frames. Besides the commitment loss for training the RVQ module, we devise two optimization objectives that encourage the class representation to be informative about the state transition within an action and the ending state of that action.
  • Figure 3: We calculate the confusion matrix comparing the predicted action classes to the ground truth to identify the bottlenecks of our method. We conduct evaluation on the union set of the three subsets, including 12 action classes.
  • Figure 4: Visualization for pre-action classes and their corresponding motion frames.
  • Figure 5: Visualization for discrete encode results of pre-training model
  • ...and 3 more figures