Table of Contents
Fetching ...

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Shuting Zhao, Zeyu Xiao, Xinrong Chen

TL;DR

KineST presents a lightweight, kinematics-guided state-space model for full-body motion tracking from sparse HMD signals, combining a Temporal Flow Module with bidirectional SSD scanning and a Spatiotemporal Kinematic Flow Module that uses a Kinematic Tree Scanning Strategy and Spatiotemporal Mixing Mechanism. A geometric angular velocity loss on SO(3) further enforces physically meaningful rotational dynamics, improving motion continuity. Across AMASS-based protocols and real headset data, KineST achieves state-of-the-art accuracy and temporal coherence with a compact architecture, enabling real-time performance suitable for AR/VR avatars and kinesthetic interactions. The work highlights the value of integrating kinematic priors and end-to-end spatiotemporal coupling to close the accuracy-smoothness-efficiency gap in sparse-signal motion tracking.

Abstract

Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

TL;DR

KineST presents a lightweight, kinematics-guided state-space model for full-body motion tracking from sparse HMD signals, combining a Temporal Flow Module with bidirectional SSD scanning and a Spatiotemporal Kinematic Flow Module that uses a Kinematic Tree Scanning Strategy and Spatiotemporal Mixing Mechanism. A geometric angular velocity loss on SO(3) further enforces physically meaningful rotational dynamics, improving motion continuity. Across AMASS-based protocols and real headset data, KineST achieves state-of-the-art accuracy and temporal coherence with a compact architecture, enabling real-time performance suitable for AR/VR avatars and kinesthetic interactions. The work highlights the value of integrating kinematic priors and end-to-end spatiotemporal coupling to close the accuracy-smoothness-efficiency gap in sparse-signal motion tracking.

Abstract

Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: https://kaka-1314.github.io/KineST/

Paper Structure

This paper contains 39 sections, 11 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of our approach with state-of-the-art methods in terms of overall performance. Our method achieves the smallest average position error and smoother motion, and maintains a lightweight model architecture.
  • Figure 2: Overall architecture. (a) The architecture of the proposed KineST model, whose main components are the temporal flow module (TFM) and spatiotemporal kinematic flow module (SKFM). (b) The shared structure of the flow module used in both TFM and SKFM, which comprises a bidirectional SSD block, a local motion aggregator (LMA), and a global motion aggregator (GMA). (c) Temporal modeling within the TFM. (d) Kinematics-guided spatiotemporal modeling within the SKFM.
  • Figure 3: Comparison of different scanning strategies.
  • Figure 4: Visualization results of different actions compared with other methods. The joint error degrees are indicated by red shading, allowing a comparative assessment of reconstruction accuracy across various poses for each method. These visuals confirm the robustness and enhancements of the proposed model, particularly in lower body predictions.
  • Figure 5: Visualization results of continuous pose sequences compared with other methods. The visualization illustrates that the proposed model delivers smoother and more realistic body motion tracking. Notably, the proposed model provides refined reconstruction highlighted by red dashed boxes.
  • ...and 1 more figures