Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
TL;DR
Free4D tackles the challenge of generating coherent 4D scenes from a single image without tuning by distilling priors from pretrained models into a spatial-temporal representation. The method animates the input to a reference video, initializes a 4D geometry with MonST3R, and then uses a point-conditioned diffusion model guided by geometry to produce spatial-temporal consistent multi-view videos, aided by adaptive CFG, point-cloud denoising, and temporal latent replacement. A modulation-based refinement lifts the results into a coherent 4D Gaussian Splatting representation via a coarse-to-fine optimization with targeted losses, enabling real-time free-view rendering. Ablation and cross-domain experiments on text-to-4D and image-to-4D tasks demonstrate competitive performance against state-of-the-art baselines and highlight the contributions of each component. Overall, Free4D offers an efficient, tuning-free solution for scene-level 4D generation from a single image, expanding practical access to dynamic 4D content creation.
Abstract
We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.
