Table of Contents
Fetching ...

HumMUSS: Human Motion Understanding using State Space Models

Arnab Kumar Mondal, Stefano Alletto, Denis Tome

TL;DR

HumMUSS introduces an attention-free spatiotemporal backbone based on diagonal state space models for human motion understanding, addressing real-time inference and frame-rate generalization limitations of Transformer models. The architecture uses two streams of gated Diagonal SSM blocks (Bidirectional and Unidirectional) to build a two-stream spatiotemporal layer that fuses spatial-then-temporal and temporal-then-spatial information with learnable weights. Pretrained on 3D motion and 2D lifting tasks, HumMUSS achieves competitive results in 3D pose estimation, mesh recovery, and skeleton-based action recognition, while offering significant gains in training speed, memory efficiency, and robust generalization to unseen frame rates, including a fully causal variant for real-time use. The work demonstrates that a continuous-time, attention-free SSM approach can match Transformer performance on multiple motion tasks while delivering practical benefits for on-device, real-time applications.

Abstract

Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.

HumMUSS: Human Motion Understanding using State Space Models

TL;DR

HumMUSS introduces an attention-free spatiotemporal backbone based on diagonal state space models for human motion understanding, addressing real-time inference and frame-rate generalization limitations of Transformer models. The architecture uses two streams of gated Diagonal SSM blocks (Bidirectional and Unidirectional) to build a two-stream spatiotemporal layer that fuses spatial-then-temporal and temporal-then-spatial information with learnable weights. Pretrained on 3D motion and 2D lifting tasks, HumMUSS achieves competitive results in 3D pose estimation, mesh recovery, and skeleton-based action recognition, while offering significant gains in training speed, memory efficiency, and robust generalization to unseen frame rates, including a fully causal variant for real-time use. The work demonstrates that a continuous-time, attention-free SSM approach can match Transformer performance on multiple motion tasks while delivering practical benefits for on-device, real-time applications.

Abstract

Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.
Paper Structure (29 sections, 14 equations, 9 figures, 6 tables)

This paper contains 29 sections, 14 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: HumMUSS vs. Transformer-Based Models for sequential prediction of 3D poses and human meshes from 2D keypoint videos. Top: Transformer-based models attend to a history of 2D poses/keypoints to predict the current frame's output. Bottom: HumMUSS, being a stateful model, efficiently utilizes only the current frame and its current state for predictions, ensuring constant memory and time complexity. HumMUSS also generalizes to new frame rates and enhances the training speed without compromising the prediction accuracy.
  • Figure 2: HumMUSS model architecture
  • Figure 3: Bidirectional Gated DSSM Block
  • Figure 4: Unidirectional Gated DSSM Block
  • Figure 5: Comparison between HumMUSS and MotionBERT zhu2023motionbert 3D pose estimation performance (MPJPE in $mm$) on MPI-INF-3DHP at different sub-sampling rates. Left: causal models; Right: bi-directional models.
  • ...and 4 more figures