Table of Contents
Fetching ...

Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking

Wei Cao, Chang Luo, Biao Zhang, Matthias Nießner, Jiapeng Tang

TL;DR

Motion2VecSets addresses the ill-posed problem of reconstructing 4D non-rigid surfaces from sparse, noisy, and partial point clouds by learning a probabilistic 4D prior through diffusion over latent sets. It introduces a 4D neural representation with a shape latent set for the reference frame and deformation latent sets for frame-to-frame motion, combined with synchronized diffusion via Interleaved Spatio-Temporal Attention to enforce spatio-temporal coherence and efficiency. The method demonstrates superior 4D reconstruction and completion on D-FAUST and DT4D-A datasets, including unseen identities and motions, and shows robustness to partial observations. Potential extensions include multi-modal 4D generation and text-driven 4D synthesis.

Abstract

We introduce Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations, conventional feed-forward networks encounter challenges with ambiguous observations from noisy, partial, or sparse point clouds. To address these challenges, we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based priors enable more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent sets instead of using global latent codes. This novel 4D representation allows us to learn local shape and deformation patterns, leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporally-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid computational overhead, we designed an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations. More detailed information can be found at https://vveicao.github.io/projects/Motion2VecSets/.

Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking

TL;DR

Motion2VecSets addresses the ill-posed problem of reconstructing 4D non-rigid surfaces from sparse, noisy, and partial point clouds by learning a probabilistic 4D prior through diffusion over latent sets. It introduces a 4D neural representation with a shape latent set for the reference frame and deformation latent sets for frame-to-frame motion, combined with synchronized diffusion via Interleaved Spatio-Temporal Attention to enforce spatio-temporal coherence and efficiency. The method demonstrates superior 4D reconstruction and completion on D-FAUST and DT4D-A datasets, including unseen identities and motions, and shows robustness to partial observations. Potential extensions include multi-modal 4D generation and text-driven 4D synthesis.

Abstract

We introduce Motion2VecSets, a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations, conventional feed-forward networks encounter challenges with ambiguous observations from noisy, partial, or sparse point clouds. To address these challenges, we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based priors enable more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent sets instead of using global latent codes. This novel 4D representation allows us to learn local shape and deformation patterns, leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporally-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid computational overhead, we designed an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations. More detailed information can be found at https://vveicao.github.io/projects/Motion2VecSets/.
Paper Structure (43 sections, 11 equations, 18 figures, 10 tables)

This paper contains 43 sections, 11 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: We present , a 4D diffusion model for dynamic surface reconstruction from sparse, noisy, or partial point cloud sequences. Compared to the existing state-of-the-art method CaDeXLei2022CaDeX, our method can reconstruct more plausible non-rigid object surfaces with complicated structures and achieve more robust motion tracking.
  • Figure 2: Overview Pipeline of Motion2VecSets. Given a sequence of sparse and noisy point clouds as inputs$\{\mathbf{P}^t\}_{t=1}^{T}$, Motion2VecSets outputs a continuous mesh sequence $\{\mathcal{M}^t\}_{t=1}^{T}$. The initial input frame $\mathbf{P}^1$ (top left) is used as a condition in the Shape Vector Set Diffusion, yielding denoised shape codes $\mathcal{S}$ for reconstructing the geometry of the reference frame $\mathcal{M}^1$ (top right). Concurrently, the subsequent input frames $\{\mathbf{P}^t\}_{t=2}^{T}$ (bottom left) are utilized in the Synchronized Deformation Vector Sets Diffusion to produce denoised deformation codes $\{\mathcal{D}^t\}_{t=2}^{T}$, where each latent set $\mathcal{D}^t$ encodes the deformation from the reference frame $\mathcal{M}^1$ to subsequent frames $\mathcal{M}^t$.
  • Figure 3: Deformation Autoencoder. Given a pair of point clouds $\mathbf{X}_{\text{src}}$ and $\mathbf{X}_{\text{tgt}}$ from two frames of a dynamic mesh sequence, we initially downsample them using farthest point sampling (FPS). Subsequently, the concatenated points are passed into transformer encoder to generate the Deformation Latent Set $\mathcal{D}$. For a query point $\mathbf{q}$ in the source space, a cross-attention layer is utilized to match the most relevant fused feature $\mathbf{z}$. This selected feature is subsequently fed into the deformation MLP decoder to predict an offset $\mathbf{\Delta\mathbf{q}}$, translating it to $\mathbf{q'}$ in the target space. To reduce the feature diversity of $\mathcal{D}$, KL-regularization is employed.
  • Figure 4: Synchronized Deformation Vector Set Diffusion. Given noised deformation vector sets $\{\hat{\mathcal{D}}^t\}_{t=2}^{T}$ (top) from a sequence, each set denoted as $\hat{\mathcal{D}^{t}} = \{\hat{\mathbf{d}}_1^t,...,\hat{\mathbf{d}}_{M}^t \}$ of timestep $t \in [2,T]$, we use repeated Interleaved Spatio-Temporal Attention Blocks (ISTA) as our denoising network. In each ISTA block, we first pass them to the space self-attention layer (Space Attention) to aggregate latent features $\hat{\mathcal{D}}^{t}$ across different spatial locations within each frame to explore spatial contexts. Next, we inject conditional information extracted from sparse or partial point clouds via cross-attention (Condition Attention) between conditional codes $\mathcal{C}^t$ and noised deformation codes $\hat{\mathcal{D}}^{t}$ at each frame. Lastly, to enhance temporal coherence, a time self-attention layer (Time Attention) is used to aggregate latent codes from the same position but from different frames, i.e.$\{\hat{\mathbf{d}}_i^t\}_{t=2}^{T}$. Repeat this ISTA block and we finally get denoised deformation latent sets $\{\mathcal{D}^t\}_{t=2}^{T}$ (bottom). Within each layer, different colored latents represent the dynamics of distinct local regions, while the same colored latents represent the dynamics of a local region at different time steps.
  • Figure 5: Comparisons of 4D Shape Reconstruction from sparse and noisy point clouds on the D-FAUST DFAUST (left) and the DT4D-A DeformingThings4D (right) datasets. We visualize the Chamfer Distance between reconstruction and ground-truths as error maps. Our method can reconstruct more accurate surface geometries and motion dynamics.
  • ...and 13 more figures