Table of Contents
Fetching ...

Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

Xiaoyan Liu, Kangrui Li, Yuehao Song, Jiaxin Liu

TL;DR

Dream4D tackles the challenge of generating spatiotemporally coherent 4D scenes from a single image by integrating semantic pose planning with geometry-guided reconstruction. It introduces a three-stage pipeline—VLM-based pose trajectory planning, pose-conditioned video diffusion, and pose-aware 4D reconstruction—enabling long-horizon pose and temporal consistency. Empirical results show strong pose accuracy and temporal fidelity, with state-of-the-art-like performance among online methods and competitive reconstruction quality against optimization-based baselines. The framework paves the way for robust, controllable dynamic scene synthesis by unifying semantic understanding, temporal dynamics, and geometric constraints.

Abstract

The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

TL;DR

Dream4D tackles the challenge of generating spatiotemporally coherent 4D scenes from a single image by integrating semantic pose planning with geometry-guided reconstruction. It introduces a three-stage pipeline—VLM-based pose trajectory planning, pose-conditioned video diffusion, and pose-aware 4D reconstruction—enabling long-horizon pose and temporal consistency. Empirical results show strong pose accuracy and temporal fidelity, with state-of-the-art-like performance among online methods and competitive reconstruction quality against optimization-based baselines. The framework paves the way for robust, controllable dynamic scene synthesis by unifying semantic understanding, temporal dynamics, and geometric constraints.

Abstract

The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

Paper Structure

This paper contains 32 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Spatiotemporally consistent 4D generation. Our method achieves excellent consistency in both pose and temporal dimensions.
  • Figure 2: Pipeline Overview. Our method consists of three stages, i.e., vision language model (VLM), video diffusion model (VD), and video-to-4D generation. Given the input images and the text prompt containing the scene information and the requirements for camera control, the VLM transmits them to a conditional representation that describes the camera pose sequence to guide the subsequent generation process. The camera poses are categorized into translation, rotation, and stationary for fine-grained control. The VLM predicted camera pose trajectory is further encoded into a trajectory condition and fed into a Diffusion Transformer (DiT) peebles2023scalablediffusionmodelstransformers based video diffusion model (VD). The generated videos are then passed to a 4D generator, which transforms them into the final 4D output.
  • Figure 3: Qualitative Evaluation of Pose Consistency. This visualization compares the structural integrity and positional coherence of a target object across sequential frames under varying camera poses. The red bounding boxes highlight the motion states of a target object across pose changes.
  • Figure 4: Qualitative Evaluation of Temporal Consistency. This visualization compares the smoothness and coherence of object motion across sequential frames under fixed camera poses. The red bounding boxes track the dynamic states of a target object, emphasizing its motion continuity over time.
  • Figure 5: Qualitative Results. We compare our method with concurrent works Shape-of-Motion wang2024shape, Cut3R cut3r and MegaSAM li2024_megasam. Our method achieves the best qualitative results.
  • ...and 2 more figures