Table of Contents
Fetching ...

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

TL;DR

This work tackles the challenge of generating dynamic 4D scenes from a single image by proposing a tightly coupled reconstruction-and-generation framework, MoRe4D. It introduces TrajScene-60K, a large-scale dataset of dense 4D trajectories, and a diffusion-based 4D Scene Trajectory Generator (4D-STraG) that jointly recovers geometry and motion. A Motion Perception Module with depth-guided normalization and a MAdaNorm conditioning scheme guide the diffusion model, while the 4D View Synthesis Module (4D-ViSM) renders high-fidelity videos from novel viewpoints. Empirical results show superior 4D consistency and visual quality compared with strong baselines, supported by extensive ablations and runtime analysis, pointing toward practical, single-image-to-4D content creation.

Abstract

Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

TL;DR

This work tackles the challenge of generating dynamic 4D scenes from a single image by proposing a tightly coupled reconstruction-and-generation framework, MoRe4D. It introduces TrajScene-60K, a large-scale dataset of dense 4D trajectories, and a diffusion-based 4D Scene Trajectory Generator (4D-STraG) that jointly recovers geometry and motion. A Motion Perception Module with depth-guided normalization and a MAdaNorm conditioning scheme guide the diffusion model, while the 4D View Synthesis Module (4D-ViSM) renders high-fidelity videos from novel viewpoints. Empirical results show superior 4D consistency and visual quality compared with strong baselines, supported by extensive ablations and runtime analysis, pointing toward practical, single-image-to-4D content creation.

Abstract

Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: https://github.com/Zhangyr2022/MoRe4D.

Paper Structure

This paper contains 35 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: MoRe4D for 4D synthesis from a single image. Most existing paradigms either suffer from geometric inconsistencies (generate-then-reconstruct) or are constrained by animating a pre-determined static geometry (vanilla reconstruct-then-generate). Our MoRe4D advances by tightly coupling geometric modeling and motion generation, effectively achieving consistent 4D motion and geometry.
  • Figure 2: TrajScene-60K curation pipeline. We curate videos from WebVid-10M, filtered via VLMs for structured motion and countable entities. Dense 4D point tracks are extracted and refined via depth filtering and Gaussian Splatting, producing 60K high-quality 4D scenes.
  • Figure 3: Pipeline of MoRe4D. Top: The 4D Scene Trajectory Generator (Sec. \ref{['sec:4D-STraG']}), a Diffusion Transformer, jointly generates geometry and motion. Bottom-Left: The Motion Perception Module (MPM) identifies potential motion regions and semantic structure from the input image. Bottom-Right: The 4D View Synthesis Module (Sec. \ref{['sec:4D-ViSM']}) renders the output into novel-view videos.
  • Figure 4: Qualitative results of our model. The first row shows the 4D point cloud generated by our 4D-STraG. The second and third rows show the videos rendered by our 4D-ViSM under two distinct, user-defined camera trajectories.
  • Figure 5: Qualitative comparison with baseline methods. For each sample, the first row shows the baseline results while the second row presents our MoRe4D results. The first column displays the input image and text prompt.
  • ...and 5 more figures