HumMUSS: Human Motion Understanding using State Space Models
Arnab Kumar Mondal, Stefano Alletto, Denis Tome
TL;DR
HumMUSS introduces an attention-free spatiotemporal backbone based on diagonal state space models for human motion understanding, addressing real-time inference and frame-rate generalization limitations of Transformer models. The architecture uses two streams of gated Diagonal SSM blocks (Bidirectional and Unidirectional) to build a two-stream spatiotemporal layer that fuses spatial-then-temporal and temporal-then-spatial information with learnable weights. Pretrained on 3D motion and 2D lifting tasks, HumMUSS achieves competitive results in 3D pose estimation, mesh recovery, and skeleton-based action recognition, while offering significant gains in training speed, memory efficiency, and robust generalization to unseen frame rates, including a fully causal variant for real-time use. The work demonstrates that a continuous-time, attention-free SSM approach can match Transformer performance on multiple motion tasks while delivering practical benefits for on-device, real-time applications.
Abstract
Understanding human motion from video is essential for a range of applications, including pose estimation, mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures, these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time, and do not generalize to new frame rates. In light of these constraints, we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover, the proposed model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy.
