Table of Contents
Fetching ...

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Vinayak Gupta, Yunze Man, Yu-Xiong Wang

TL;DR

PaintScene4D tackles the challenge of photorealistic dynamic 4D scene generation from text prompts by adopting a training-free pipeline that leverages video diffusion priors to bootstrap a multi-view 4D representation. The method introduces a Progressive Warping Module (PWM) and a Consistent Inpainting Module (CIM) to achieve spatial-temporal coherence, followed by a 4D Gaussian renderer that synthesizes novel views along user-defined trajectories. It demonstrates state-of-the-art results against baselines on CLIP and human preferences, with runtime around 2–3 hours on a single A100 and explicit camera control, and generalizes to real-world videos. This work reduces computation, enhances camera control, and enables high-fidelity scene-level 4D content from text prompts, with potential applications in immersive content creation.

Abstract

Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at https://paintscene4d.github.io/

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

TL;DR

PaintScene4D tackles the challenge of photorealistic dynamic 4D scene generation from text prompts by adopting a training-free pipeline that leverages video diffusion priors to bootstrap a multi-view 4D representation. The method introduces a Progressive Warping Module (PWM) and a Consistent Inpainting Module (CIM) to achieve spatial-temporal coherence, followed by a 4D Gaussian renderer that synthesizes novel views along user-defined trajectories. It demonstrates state-of-the-art results against baselines on CLIP and human preferences, with runtime around 2–3 hours on a single A100 and explicit camera control, and generalizes to real-world videos. This work reduces computation, enhances camera control, and enables high-fidelity scene-level 4D content from text prompts, with potential applications in immersive content creation.

Abstract

Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at https://paintscene4d.github.io/

Paper Structure

This paper contains 21 sections, 2 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: 4D Text-to-Scene Generation. Unlike prior methods that restrict text-to-4D generation to object-level reconstruction or text-to-video models lacking explicit camera control, our approach reconstructs full realistic 4D scenes that can be viewed from different trajectories, achieving via an efficient training-free architecture.
  • Figure 2: Method Overview. Our approach consists of three stages. First, we initialize the 4D scene using a diffusion prior to establish scene content and motion, estimate depth maps for each video frame, and initialize camera trajectory (extrinsics) and intrinsics for subsequent warping. In the second stage, we perform sequential warping and inpainting from the first timestamp. To ensure spatial and temporal coherence, our consistent inpainting module mitigates artifacts and aligns depth maps, preventing error accumulation. Finally, the generated view-time matrix is used to render novel views along user-defined camera trajectories, allowing for explicit camera control.
  • Figure 3: Gallery of Results. PaintScene4D successfully generates 4D scenes that maintain view- and temporal-coherence. The horizontal axis represents the time; the vertical axis represents different viewpoints. More visualizations are provided in the supplementary materials.
  • Figure 4: Comparisons with state-of-the-art text-to-4D generation methods. While both baseline methods produce scenes that broadly align with the text prompts, they lack essential fine details. Specifically, 4D-fy shows minimal motion and limited detail, whereas Dream-in-4D captures dynamics more effectively but produces stylized, cartoon-like renderings. In contrast, our method synthesizes photorealistic 4D scenes that faithfully follow the input text prompt while presenting significant, realistic dynamics within the scene.
  • Figure 5: Comparison against 4Real.yu20244real We demonstrate that our method produces more dynamics, larger scene coverage and better video-text alignment, and overall realism scenes.
  • ...and 8 more figures