Table of Contents
Fetching ...

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Xiaoyu Zhu, Po-Yao Huang, Junwei Liang, Celso M. de Melo, Alexander Hauptmann

TL;DR

This work tackles MoCap-based action recognition by directly modeling raw mesh sequences rather than relying on intermediate skeleton representations. It introduces STMT, a Spatial-Temporal Mesh Transformer that uses surface field convolution to form vertex patches and a hierarchical transformer with intra-frame offset-attention and inter-frame self-attention to capture global spatial-temporal dependencies. Two self-supervised pretraining tasks, Masked Vertex Modeling and Future Frame Prediction, reinforce global context learning and improve downstream action recognition, with extensive data augmentation via Joint Shuffle. Empirically, STMT achieves state-of-the-art results on KIT and BABEL benchmarks, and shows robustness to noisy body pose estimates, highlighting the practical benefits of mesh-based action understanding for MoCap data.

Abstract

We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

TL;DR

This work tackles MoCap-based action recognition by directly modeling raw mesh sequences rather than relying on intermediate skeleton representations. It introduces STMT, a Spatial-Temporal Mesh Transformer that uses surface field convolution to form vertex patches and a hierarchical transformer with intra-frame offset-attention and inter-frame self-attention to capture global spatial-temporal dependencies. Two self-supervised pretraining tasks, Masked Vertex Modeling and Future Frame Prediction, reinforce global context learning and improve downstream action recognition, with extensive data augmentation via Joint Shuffle. Empirically, STMT achieves state-of-the-art results on KIT and BABEL benchmarks, and shows robustness to noisy body pose estimates, highlighting the practical benefits of mesh-based action understanding for MoCap data.

Abstract

We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.
Paper Structure (28 sections, 11 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 11 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Current state-of-the-art MoCap-based action recognition methods first convert body markers into a human body mesh, which is used to predict a standardized 3D skeleton. The 3D skeleton is used as input for action recognition models (dashed line). We propose a method that directly models the dynamics of raw mesh sequences (solid line). Our method saves the manual effort to derive skeleton representation, and achieves superior recognition performance by leveraging surface motion and body shape knowledge from meshes.
  • Figure 2: Overview of the proposed framework. (a) Overview of STMT. Given a mesh sequence, we first develop vertex patches by extracting both intrinsic (geodesic) and extrinsic (euclidean) features using surface field convolution. The intrinsic and extrinsic features are denoted by yellow and blue blocks respectively. Those patches are used as input to the intra-frame offset-attention network to learn appearance features. Then we concatenate intrinsic patches and extrinsic patches of the same position. The concatenated vertex patches (green blocks) are fed into the inter-frame self-attention network to learn spatial-temporal correlations. Finally, the local and global features are mapped into action predictions by MLP layers. (b) Overview of Pre-Training Stage. We design two pretext tasks: masked vertex modeling and future frame prediction for global context learning. Bidirectional attention is used for the reconstruction of masked vertices. Auto-regressive attention is used for the future frame prediction task.
  • Figure 3: Visualization of inter-frame attention. Red denotes the highest attention.
  • Figure 4: Effect of Different Number of Mesh Sequences.