Table of Contents
Fetching ...

SMamDiff: Spatial Mamba for Stochastic Human Motion Prediction

Junqiao Fan, Pengfei Liu, Haocong Rao

TL;DR

The paper tackles stochastic human motion prediction by proposing SMamDiff, a one-stage diffusion model that enforces temporal and spatial coherence through a residual-DCT motion encoding and a stickman-drawing spatial module. By operating directly in a residual-DCT domain and applying an ordered, joint-by-joint spatial scanning with a state-space model, the method achieves state-of-the-art performance among single-stage HMP approaches while reducing latency and memory. It also introduces a K-Diversity objective to promote multiple plausible futures without sacrificing realism. The approach demonstrates strong results on Human3.6M and HumanEva-I, highlighting practical impact for edge deployments in e-health and robotics.

Abstract

With intelligent room-side sensing and service robots widely deployed, human motion prediction (HMP) is essential for safe, proactive assistance. However, many existing HMP methods either produce a single, deterministic forecast that ignores uncertainty or rely on probabilistic models that sacrifice kinematic plausibility. Diffusion models improve the accuracy-diversity trade-off but often depend on multi-stage pipelines that are costly for edge deployment. This work focuses on how to ensure spatial-temporal coherence within a single-stage diffusion model for HMP. We introduce SMamDiff, a Spatial Mamba-based Diffusion model with two novel designs: (i) a residual-DCT motion encoding that subtracts the last observed pose before a temporal DCT, reducing the first DC component ($f=0$) dominance and highlighting informative higher-frequency cues so the model learns how joints move rather than where they are; and (ii) a stickman-drawing spatial-mamba module that processes joints in an ordered, joint-by-joint manner, making later joints condition on earlier ones to induce long-range, cross-joint dependencies. On Human3.6M and HumanEva, these coherence mechanisms deliver state-of-the-art results among single-stage probabilistic HMP methods while using less latency and memory than multi-stage diffusion baselines.

SMamDiff: Spatial Mamba for Stochastic Human Motion Prediction

TL;DR

The paper tackles stochastic human motion prediction by proposing SMamDiff, a one-stage diffusion model that enforces temporal and spatial coherence through a residual-DCT motion encoding and a stickman-drawing spatial module. By operating directly in a residual-DCT domain and applying an ordered, joint-by-joint spatial scanning with a state-space model, the method achieves state-of-the-art performance among single-stage HMP approaches while reducing latency and memory. It also introduces a K-Diversity objective to promote multiple plausible futures without sacrificing realism. The approach demonstrates strong results on Human3.6M and HumanEva-I, highlighting practical impact for edge deployments in e-health and robotics.

Abstract

With intelligent room-side sensing and service robots widely deployed, human motion prediction (HMP) is essential for safe, proactive assistance. However, many existing HMP methods either produce a single, deterministic forecast that ignores uncertainty or rely on probabilistic models that sacrifice kinematic plausibility. Diffusion models improve the accuracy-diversity trade-off but often depend on multi-stage pipelines that are costly for edge deployment. This work focuses on how to ensure spatial-temporal coherence within a single-stage diffusion model for HMP. We introduce SMamDiff, a Spatial Mamba-based Diffusion model with two novel designs: (i) a residual-DCT motion encoding that subtracts the last observed pose before a temporal DCT, reducing the first DC component () dominance and highlighting informative higher-frequency cues so the model learns how joints move rather than where they are; and (ii) a stickman-drawing spatial-mamba module that processes joints in an ordered, joint-by-joint manner, making later joints condition on earlier ones to induce long-range, cross-joint dependencies. On Human3.6M and HumanEva, these coherence mechanisms deliver state-of-the-art results among single-stage probabilistic HMP methods while using less latency and memory than multi-stage diffusion baselines.

Paper Structure

This paper contains 16 sections, 11 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: System architecture of SMamDiff, the proposed diffusion-based HMP framework. Top-left: overall diffusion process operating in the residual-DCT domain. Bottom-left: network architecture incorporating the spatial Mamba design to mimic stickman drawing. Right: detailed Mam-Trans block used in the SMamDiff architecture.
  • Figure 2: Our proposed residual-DCT for converting time-domain (TD) human motion $X$ into frequency-domain (FD) human motion representation $\chi^r$. The traditional FD DCT motion representation $\chi^{org}$ is presented for comparison.
  • Figure 3: Left: “Stickman-drawing” scan starting from the head. The blue number on each joint shows its Joint Repeat (how many times that joint is scanned during the forward pass). Right: For each joint v, only the last visit is kept as its Joint Select feature; earlier visits are discarded. Backtracking at leaves lets later joints condition on earlier ones, creating long-range, cross-joint coherence across all V joints in a single pass.
  • Figure 4: Qualitative comparisons by visualizing prediction results for 4 actions from the Human3.6M dataset. Each column depicts the temporal evolution of motion from $0.5$ s to $2.0$ s, with the ground-truth sequence shown at the top, followed by the multi-hypothesis predictions and a stacked visualization of all predicted trajectories.