Table of Contents
Fetching ...

Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

Haoyu Ji, Bowen Chen, Weihong Ren, Wenze Huang, Zhihao Yang, Zhiyong Wang, Honghai Liu

TL;DR

This work addresses skeleton-based temporal action segmentation by introducing TRG-Net, which injects text-derived priors from large language models to both model spatial-temporal relations and supervise class distributions. The method introduces Text-derived Joint Graphs (TJG) and Text-derived Action Graphs (TAG) via DSFM and ARIS, respectively, and adds Spatial-Aware Enhancement Processing to bolster generalization. Empirical results on PKU-MMD, LARa, MCFS-130 show state-of-the-art performance with favorable efficiency, while extensive ablations validate the contribution of each component and the benefits of text-based priors for action semantics. The approach advances semantic understanding and robustness in skeleton-based action segmentation with potential impact on robotics, surveillance, and assistive technologies.

Abstract

Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

TL;DR

This work addresses skeleton-based temporal action segmentation by introducing TRG-Net, which injects text-derived priors from large language models to both model spatial-temporal relations and supervise class distributions. The method introduces Text-derived Joint Graphs (TJG) and Text-derived Action Graphs (TAG) via DSFM and ARIS, respectively, and adds Spatial-Aware Enhancement Processing to bolster generalization. Empirical results on PKU-MMD, LARa, MCFS-130 show state-of-the-art performance with favorable efficiency, while extensive ablations validate the contribution of each component and the benefits of text-based priors for action semantics. The approach advances semantic understanding and robustness in skeleton-based action segmentation with potential impact on robotics, surveillance, and assistive technologies.

Abstract

Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

Paper Structure

This paper contains 43 sections, 17 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Schematic of TRG-Net concept. The text embeddings and relational graphs generated by large language models can serve as priors for enhancing modeling and supervision of action segmentation. Specifically, the text-derived joint graph effectively captures spatial correlations, while the text-derived action graph and action embeddings supervise the relationships and distributions of action classes.
  • Figure 2: Overview of the TRG-Net. The TRG-Net consists of modeling, supervision, and data processing. The Dynamic Spatio-Temporal Fusion Modeling (DSFM) employs Text-Derived Joint Graphs (TJG) to support spatial dynamic modeling and achieves spatio-temporal fusion. The Absolute-Relative Inter-class Supervision (ARIS) utilizes Text-Derived Action Graphs (TAG) for relational supervision of action segments and action text embeddings for distribution supervision. Additionally, the Spatial-Aware Enhancement Processing (SAEP) method is introduced to further enhance generalization.
  • Figure 3: Generation of Text-Derived Relational Graphs. The descriptions of each joint and action are input into BERT to obtain the corresponding text embeddings. These embeddings are then processed using L2 distance calculations and inverse normalization to generate the TJG and TAG, which represent the semantic relationships between joints and between actions.
  • Figure 4: Dynamic Spatio-Temporal Fusion Modeling. The process begins with multi-scale GCN for initial spatial modeling. Then the combination of TJG and feature-derived channel- and frame-level dynamic joint graphs is used for fine-grained adaptive spatial modeling. Subsequently, adaptive weight allocation through convolution merges joint features with channel features, while linear transformers conducts multi-layer temporal modeling to establish inter-frame relationships. Spatial features with weighted adjustments are fused after each temporal layer, progressively integrating core spatio-temporal features.
  • Figure 5: Absolute-Relative Inter-Class Supervision. First, representations are segmented based on ground truth boundaries and pooled to obtain individual action features. Absolute supervision (left) compares each action feature with its corresponding text embedding via cosine similarity, optimizing class distributions through contrastive learning. Relative supervision (right) aligns computed action feature relations with the semantic relations in TAG, refining inter-class relative relationships.
  • ...and 4 more figures