Table of Contents
Fetching ...

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu

TL;DR

Free4D tackles the challenge of generating coherent 4D scenes from a single image without tuning by distilling priors from pretrained models into a spatial-temporal representation. The method animates the input to a reference video, initializes a 4D geometry with MonST3R, and then uses a point-conditioned diffusion model guided by geometry to produce spatial-temporal consistent multi-view videos, aided by adaptive CFG, point-cloud denoising, and temporal latent replacement. A modulation-based refinement lifts the results into a coherent 4D Gaussian Splatting representation via a coarse-to-fine optimization with targeted losses, enabling real-time free-view rendering. Ablation and cross-domain experiments on text-to-4D and image-to-4D tasks demonstrate competitive performance against state-of-the-art baselines and highlight the contributions of each component. Overall, Free4D offers an efficient, tuning-free solution for scene-level 4D generation from a single image, expanding practical access to dynamic 4D content creation.

Abstract

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

TL;DR

Free4D tackles the challenge of generating coherent 4D scenes from a single image without tuning by distilling priors from pretrained models into a spatial-temporal representation. The method animates the input to a reference video, initializes a 4D geometry with MonST3R, and then uses a point-conditioned diffusion model guided by geometry to produce spatial-temporal consistent multi-view videos, aided by adaptive CFG, point-cloud denoising, and temporal latent replacement. A modulation-based refinement lifts the results into a coherent 4D Gaussian Splatting representation via a coarse-to-fine optimization with targeted losses, enabling real-time free-view rendering. Ablation and cross-domain experiments on text-to-4D and image-to-4D tasks demonstrate competitive performance against state-of-the-art baselines and highlight the contributions of each component. Overall, Free4D offers an efficient, tuning-free solution for scene-level 4D generation from a single image, expanding practical access to dynamic 4D content creation.

Abstract

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

Paper Structure

This paper contains 17 sections, 17 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Free4D can generate diverse 4D scenes from single-image or textual input. By enforcing spatial-temporal consistency in a tuning-free way, Free4D enables high-quality scene generation with explicit 4D controls.
  • Figure 2: Overview of Free4D. Given an input image or text prompt, we first generate a dynamic video $\mathcal{V}=\{I(t,1)\}_{t=1}^{T}$ using an off-the-shelf video generation model kling. Then, we employ MonST3R monst3r with a progressive static point cloud aggregation strategy for dynamic reconstruction, obtaining a 4D geometric structure. Next, guided by this structure, we render a coarse multi-view video $\mathcal{S}^{\prime}=\{\{I^{\prime}(t,k)\}_{t=1}^{T}\}_{k=1}^{K}$ along a predefined camera trajectory and refine it into $\mathcal{S}=\{\{I(t,k)\}_{t=1}^{T}\}_{k=1}^{K}$ using ViewCrafter viewcrafter. To ensure spatial-temporal consistency, we introduce Adaptive Classifer-Free Guidance (CFG) and Point Cloud Guided Denoising for spatial coherence, along with Reference Latent Replacement for temporal coherence. Finally, we propose an efficient training strategy with a Modulation-Based Refinement to lift the generated multi-view video $\mathcal{S}$ into a consistent 4D representation $\mathcal{R}$.
  • Figure 3: Qualitative comparisons of image-to-4D. We present the results using the same single-image prompts.
  • Figure 4: Qualitative comparisons of text-to-4D. We show the results based on the same text prompts.
  • Figure 5: Qualitative Comparison of Ablation Studies.
  • ...and 3 more figures