Table of Contents
Fetching ...

SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations

Shuting Zhao, Linxin Bai, Liangjing Shao, Ye Zhang, Xinrong Chen

TL;DR

The paper tackles real-time full-body pose estimation from sparse HMD signals in AR/VR. It introduces SSD-Poser, a lightweight architecture based on State Space Duality that combines a State Space Attention Encoder (SSAE) with a Frequency-Aware Decoder (FAD) to balance accuracy and speed. Key contributions include a three-stage pipeline (base feature extraction, spatiotemporal encoding, and pose refinement), a hybrid SSAE that fuses state-space dynamics with Transformer-style attention, and a frequency-aware module that reduces jitter while preserving detail; extensive AMASS experiments demonstrate state-of-the-art accuracy and fast, consistent inference. The approach enables realistic, responsive avatar reconstruction from limited HMD data, supporting immersive AR/VR interactions with lower latency and improved motion smoothness.

Abstract

The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.

SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations

TL;DR

The paper tackles real-time full-body pose estimation from sparse HMD signals in AR/VR. It introduces SSD-Poser, a lightweight architecture based on State Space Duality that combines a State Space Attention Encoder (SSAE) with a Frequency-Aware Decoder (FAD) to balance accuracy and speed. Key contributions include a three-stage pipeline (base feature extraction, spatiotemporal encoding, and pose refinement), a hybrid SSAE that fuses state-space dynamics with Transformer-style attention, and a frequency-aware module that reduces jitter while preserving detail; extensive AMASS experiments demonstrate state-of-the-art accuracy and fast, consistent inference. The approach enables realistic, responsive avatar reconstruction from limited HMD data, supporting immersive AR/VR interactions with lower latency and improved motion smoothness.

Abstract

The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our method takes sparse observations as inputs and reconstructs full-body motion as output, outperforming the state-of-the-art methods.
  • Figure 2: Comparison of our approach with state-of-the-art methods in terms of overall performance. Our method achieves the smallest average position error, the fast inference speed, and maintains a lightweight model architecture.
  • Figure 3: The overall framework of this work. (a) The framework of the proposed SSD-Poser model. (b) The framework of the State Space Attention Encoder (SSAE). (c) The framework of the Pose State Space Block (PSSB). (d) The framework of the Frequency-Aware Decoder (FAD). (e) The framework of the Frequency-Aware Feature Extractor (FAFE).
  • Figure 4: Visualization results of different actions compared with other state-of-the-art methods. Red regions indicate reconstruction errors, where wider or deeper red areas represent larger difference.
  • Figure 5: Visualization results of continuous pose sequences compared with other methods. Dashed boxes are used to highlight the most significant differences.
  • ...and 1 more figures