Table of Contents
Fetching ...

A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals

Jiangnan Tang, Jingya Wang, Kaiyang Ji, Lan Xu, Jingyi Yu, Ye Shi

TL;DR

This work tackles the challenging problem of estimating full-body human motion in 3D scenes from sparse tracking signals typical of AR/VR devices. It introduces S^2Fusion, a conditional diffusion framework that fuses scene geometry, sparse upper-body signals, and periodic alignment features, guided by a pre-trained motion prior to initialize diffusion and loss-guided sampling to regularize the lower body. The approach combines a VAE-based motion prior, a periodic autoencoder for time alignment, and two scene-aware losses—scene-penetration and phase-matching—to produce plausible, coherent motions that respect scene geometry. Empirical results on CIRCLE and GIMO show state-of-the-art performance in accuracy and motion smoothness, with ablations confirming the efficacy of each component and loss-guided sampling.

Abstract

Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions, which endowed inherent ambiguities. To help resolve this ambiguous problem, we introduce a new framework to combine rich contextual information provided by scenes to benefit full-body motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes, we develop $\text{S}^2$Fusion, a unified framework fusing \underline{S}cene and sparse \underline{S}ignals with a conditional dif\underline{Fusion} model. $\text{S}^2$Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder, and then produces time-alignment feature embedding as additional inputs. Subsequently, by drawing initial noisy motion from a pre-trained prior, $\text{S}^2$Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of $\text{S}^2$Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss, which effectively regularizes the motion of the lower body even in the absence of any tracking signals, making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our $\text{S}^2$Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness.

A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals

TL;DR

This work tackles the challenging problem of estimating full-body human motion in 3D scenes from sparse tracking signals typical of AR/VR devices. It introduces S^2Fusion, a conditional diffusion framework that fuses scene geometry, sparse upper-body signals, and periodic alignment features, guided by a pre-trained motion prior to initialize diffusion and loss-guided sampling to regularize the lower body. The approach combines a VAE-based motion prior, a periodic autoencoder for time alignment, and two scene-aware losses—scene-penetration and phase-matching—to produce plausible, coherent motions that respect scene geometry. Empirical results on CIRCLE and GIMO show state-of-the-art performance in accuracy and motion smoothness, with ablations confirming the efficacy of each component and loss-guided sampling.

Abstract

Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions, which endowed inherent ambiguities. To help resolve this ambiguous problem, we introduce a new framework to combine rich contextual information provided by scenes to benefit full-body motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes, we develop Fusion, a unified framework fusing \underline{S}cene and sparse \underline{S}ignals with a conditional dif\underline{Fusion} model. Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder, and then produces time-alignment feature embedding as additional inputs. Subsequently, by drawing initial noisy motion from a pre-trained prior, Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss, which effectively regularizes the motion of the lower body even in the absence of any tracking signals, making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness.
Paper Structure (27 sections, 22 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 22 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Given sparse tracking signals from only the head and left/right hands, our method accurately estimates full-body motion in the 3D scene.
  • Figure 2: Illustration of $\text{S}^2$Fusion pipeline. Given the sparse tracking signals $\mathbf{p}^{1:N}$ and scene geometry $\mathcal{S}$, $\text{S}^2$Fusion generates full-body motion with scene awareness and coherent upper and lower body movements. (1) The pre-trained motion prior $f_\phi$ first samples the initial noisy motion $\tilde{\mathbf{x}}^{1:N}$ for the reverse diffusion process; (2) then the periodic motion features $\mathbf{f}^{1:N}$ are extracted by a periodic autoencoder, and combined with encoded scene feature $\mathbf{E}_{\mathcal{S}}$ and the sparse tracking signals $\mathbf{p}^{1:N}$ to form the final conditioning input $\mathbf{c}$ to the reverse diffusion process; (3) the conditional diffusion model predicts the clean motion $\mathbf{x}^{1:N}_0$ from noisy motion $\tilde{\mathbf{x}}^{1:N}$ conditioned on $\mathbf{c}$; (4) the diffusion sampling process is further guided by the gradient of $\ell_{\text{penetration}}$ and $\ell_{\text{phase}}$ to generate scene-aware and physically plausible motions.
  • Figure 3: A visualization of the periodic motion features of the upper and lower body extracted from randomly selected motion sequences in AMASSmahmood2019amass. The phase shift of the sinusoidal functions indicates the time-alignment of the upper and lower body motions, while the amplitude resembles the momentum. It can be shown that the periodic motion features of the upper body are correlated with that of the lower body.
  • Figure 4: Qualitative results on the CIRCLE Araujo_2023_CVPR dataset. We show the results of two motion sequences in different scenes and highlight the implausible motions in the red box. It can be shown that our method generates more correlated leg motions and avoids scene penetration as much as possible.
  • Figure 5: The structure of our VAE-based motion prior, consists of encoder $\mathcal{E}$ and decoder $\mathcal{D}$.
  • ...and 5 more figures