SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations
Shuting Zhao, Linxin Bai, Liangjing Shao, Ye Zhang, Xinrong Chen
TL;DR
The paper tackles real-time full-body pose estimation from sparse HMD signals in AR/VR. It introduces SSD-Poser, a lightweight architecture based on State Space Duality that combines a State Space Attention Encoder (SSAE) with a Frequency-Aware Decoder (FAD) to balance accuracy and speed. Key contributions include a three-stage pipeline (base feature extraction, spatiotemporal encoding, and pose refinement), a hybrid SSAE that fuses state-space dynamics with Transformer-style attention, and a frequency-aware module that reduces jitter while preserving detail; extensive AMASS experiments demonstrate state-of-the-art accuracy and fast, consistent inference. The approach enables realistic, responsive avatar reconstruction from limited HMD data, supporting immersive AR/VR interactions with lower latency and improved motion smoothness.
Abstract
The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.
