Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

Bowen Chen; Haoyu Ji; Zhiyong Wang; Benjamin Filtjens; Chunzhuo Wang; Weihong Ren; Bart Vanrumste; Honghai Liu

Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

Bowen Chen, Haoyu Ji, Zhiyong Wang, Benjamin Filtjens, Chunzhuo Wang, Weihong Ren, Bart Vanrumste, Honghai Liu

TL;DR

This work proposes a novel method named Language-assisted Human Part Motion Representation Learning (LPL), which contains a Disentangled Part Motion Encoder (DPE) to extract dual-level motion representations and a Language-assisted Distribution Alignment (LDA) strategy for optimizing spatial relations within representations.

Abstract

Skeleton-based Temporal Action Segmentation involves the dense action classification of variable-length skeleton sequences. Current approaches primarily apply graph-based networks to extract framewise, whole-body-level motion representations, and use one-hot encoded labels for model optimization. However, whole-body motion representations do not capture fine-grained part-level motion representations and the one-hot encoded labels neglect the intrinsic semantic relationships within the language-based action definitions. To address these limitations, we propose a novel method named Language-assisted Human Part Motion Representation Learning (LPL), which contains a Disentangled Part Motion Encoder (DPE) to extract dual-level (i.e., part and whole-body) motion representations and a Language-assisted Distribution Alignment (LDA) strategy for optimizing spatial relations within representations. Specifically, after part-aware skeleton encoding via DPE, LDA generates dual-level action descriptions to construct a textual embedding space with the help of a large-scale language model. Then, LDA motivates the alignment of the embedding space between text descriptions and motions. This alignment allows LDA not only to enhance intra-class compactness but also to transfer the language-encoded semantic correlations among actions to skeleton-based motion learning. Moreover, we propose a simple yet efficient Semantic Offset Adapter to smooth the cross-domain misalignment. Our experiments indicate that LPL achieves state-of-the-art performance across various datasets (e.g., +4.4\% Accuracy, +5.6\% F1 on the PKU-MMD dataset). Moreover, LDA is compatible with existing methods and improves their performance (e.g., +4.8\% Accuracy, +4.3\% F1 on the LARa dataset) without additional inference costs.

Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

TL;DR

Abstract

Paper Structure (33 sections, 15 equations, 6 figures, 9 tables)

This paper contains 33 sections, 15 equations, 6 figures, 9 tables.

Introduction
Related Work
Video-based Temporal Action Segmentation
Skeleton-based Temporal Action Segmentation
GCN-based Skeleton-based Action Recognition
Language Model Application in Action Understanding
Method
Preliminaries
General STAS Pipeline
Pooling-based Part Modeling
Disentangled Part Feature Learning
Part Feature Banks
Global&Part-Level Spatio-Temporal Modeling
Part-Global Interaction
Language-assisted Distribution Alignment
...and 18 more sections

Figures (6)

Figure 1: Comparison between previous STAS methods and ours. Previous methods extracts global (whole-body) features and optimized the motion representations only with one-hot encoded label, losing the part-level motion understanding and semantic correlations across actions. Our method enhances action understanding at the part-level and integrates semantic correlations as auxiliary knowledge into motion representation learning, achieving a deeper comprehension aligned with human cognition.
Figure 2: Framework of the proposed Language-Assisted Human Part Feature Learning (LPL). The skeleton sequence are encoded by disentangled part feature learning module (Sec. III.A). In addition to general framewise classification, we incorporate additional language-assisted distribution learning to generate textual feature distribution from the view of human knowledge. The contrastive loss $\mathcal{L}_{aln}$ motivates skeleton features distribution close to textual feature distribution to achieve human knowledge injection.
Figure 3: The structure comparison among our DPE and previous motion encoders. a) Traditional C-ST modeling and corresponding encoder. The part features are achieved by joints pooling over global features; b) State-of-the-art DeST modeling method, where the part modeling is absent. c) Our DPE, which extract part and body features in parallel, and keep interactions across part and whole-body. This design combines the advantages of part-awareness from C-ST and discriminative features from DeST.
Figure 4: The structure of representation space comparison between textual descriptions and the motions extracted from DPE (part hip trained on Partial of PKU-MMD dataset). The ★ in left figure denotes the textual representation extracted by CLIP(ViT/B-32) CLIP. The ★ and $\bullet$ in right figure denotes the class-wise pooled representations and instance-wise segment representations, respectively. It can be visualized that the one-hot label optimized motion representation space exists 1) severe semantic correlations lost and 2) unclear inter-class decision boundary because of low intra-class compactness.
Figure 5: Predictions visualization from our LPL, DeST-Transformer, and MS-GCN. The first row shows the ground-truth labels, and the bottom three rows shows the corresponding predictions. The failure cases are enclosed by dashed boxes.
...and 1 more figures

Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

TL;DR

Abstract

Language-Assisted Human Part Motion Learning for Skeleton-Based Temporal Action Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)