Table of Contents
Fetching ...

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu, Jiahao Lu, Wenbo Hu, Xiaoguang Han, Jianfei Cai, Ying Shan, Chuanxia Zheng

TL;DR

MotionCrafter presents a diffusion-based framework that jointly reconstructs dense 4D geometry and dense scene motion from monocular video. It introduces a unified 4D latent by coupling Geometry VAE and Motion VAE, enabling end-to-end feed-forward reconstruction without post-optimization, and demonstrates strong improvements over state-of-the-art in both geometry and scene flow in world coordinates. A key finding is that strict alignment of 4D latent values to diffusion priors is not necessary; a relaxed normalization and two-stage VAE training suffice to leverage pre-trained video priors effectively. The approach yields robust, temporally coherent 4D reconstructions on diverse datasets, with practical implications for video understanding, robotics, and world-model learning, while also highlighting the potential for multi-modal extensions in future work.

Abstract

We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

TL;DR

MotionCrafter presents a diffusion-based framework that jointly reconstructs dense 4D geometry and dense scene motion from monocular video. It introduces a unified 4D latent by coupling Geometry VAE and Motion VAE, enabling end-to-end feed-forward reconstruction without post-optimization, and demonstrates strong improvements over state-of-the-art in both geometry and scene flow in world coordinates. A key finding is that strict alignment of 4D latent values to diffusion priors is not necessary; a relaxed normalization and two-stage VAE training suffice to leverage pre-trained video priors effectively. The approach yields robust, temporally coherent 4D reconstructions on diverse datasets, with practical implications for video understanding, robotics, and world-model learning, while also highlighting the potential for multi-modal extensions in future work.

Abstract

We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
Paper Structure (57 sections, 26 equations, 11 figures, 8 tables)

This paper contains 57 sections, 26 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: MotionCrafter is a video diffusion-based framework for jointly dense geometry and motion reconstruction. Given a monocular video as input, MotionCrafter simultaneously predicts dense point map and scene flow for each frame within a shared world coordinate system, which outperforms optimization-based alternatives, yet without requiring any post-optimization.
  • Figure 2: Overview of MotionCrafter. We first train a novel 4D VAE (bottom-right), consisting of a Geometry VAE and a Motion VAE. These two components jointly encode the point map and scene flow into a unified 4D latent representation. Within the Diffusion Unet, we leverage the pretrained VAE from SVD (Stable Video Diffusion) to encode video latents as conditional inputs, which are then channel-wise concatenated with our 4D latent to guide the denoising process. We only add noise to the 4D latents during model training for the Diffusion version. Note that we do not enforce the 4D latent distribution to strictly align with the original SVD VAE latent distribution. And we find that this relaxed training strategy consistently improves the generalization performance of both the VAE and the Diffusion Unet.
  • Figure 3: Geometry and Motion representation. For a pixel $p_i$ in frame $\bm{I}_i$, $\bm{X}_i$ is its corresponding 3D point. As this 3D point moves, we use $\bm{X}_i^d$ to represent the moved point and $\bm{V}_{i} = (\Delta x, \Delta y, \Delta z)$ to represent the motion. Ideally, $\bm{X}_i^d$ should align with a matching point $\bm{X}_{i+1}$ in next frame $\bm{I}_{i+1}$. However, their pixel indexes are totally different ($p_i$ vs. $p_{i+1}$) and $p_{i+1}$ might even be out of view due to camera/object motion, making it impossible to build one-to-one correspondence between $\bm{X}_i^d$ and $\bm{X}_{i+1}$.
  • Figure 4: Results of different normalization and VAE training strategies. For outdoor scenes with significant variations in depth (the second row), the original VAE fails to recover the scene structure. Even with decoder fine-tuning, the reconstruction quality remains poor. Our proposed mean normalization and VAE training strategy significantly improve reconstruction quality.
  • Figure 5: Qualitative comparison with Zero-MSF liang2025zero. Zoom in for the details. Compared to Zero-MSF, we have a more reasonable scene structure and better geometric details. More importantly, our predicted 3D scene flow has a more accurate direction of motion.
  • ...and 6 more figures