Table of Contents
Fetching ...

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin, Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, Yadong Mu

TL;DR

Diff4Splat addresses the challenge of single-image controllable 4D scene generation by unifying video diffusion priors with a holistic explicit 4D representation. It introduces a Latent Dynamic Reconstruction Model (LDRM) that maps latent video tokens, conditioned on a camera trajectory, into a deformable 3D Gaussian field whose per-frame deformations encode motion. A multi-term training objective (Flow Matching, Photometric, Geometric, and Motion losses) plus a three-stage progressive scheme yields high-fidelity, temporally coherent 4D scenes with real-time rendering, all without test-time optimization. To support learning, the authors construct a large-scale 4D dataset with metric-depth annotations from synthetic and real-world videos, enabling robust geometry and motion priors. Overall, Diff4Splat delivers camera-controllable 4D content with competitive quality and significantly faster reconstruction (≈30 seconds) than optimization-based methods, with strong potential for XR, robotics, and simulation applications.

Abstract

We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

TL;DR

Diff4Splat addresses the challenge of single-image controllable 4D scene generation by unifying video diffusion priors with a holistic explicit 4D representation. It introduces a Latent Dynamic Reconstruction Model (LDRM) that maps latent video tokens, conditioned on a camera trajectory, into a deformable 3D Gaussian field whose per-frame deformations encode motion. A multi-term training objective (Flow Matching, Photometric, Geometric, and Motion losses) plus a three-stage progressive scheme yields high-fidelity, temporally coherent 4D scenes with real-time rendering, all without test-time optimization. To support learning, the authors construct a large-scale 4D dataset with metric-depth annotations from synthetic and real-world videos, enabling robust geometry and motion priors. Overall, Diff4Splat delivers camera-controllable 4D content with competitive quality and significantly faster reconstruction (≈30 seconds) than optimization-based methods, with strong potential for XR, robotics, and simulation applications.

Abstract

We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Paper Structure

This paper contains 46 sections, 7 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Given a single image, a specified camera trajectory, and an optional text prompt, our diffusion-based framework directly generates a deformable 3D Gaussian field without test-time optimization. The resulting representation supports diverse applications, including video generation, depth rendering, and novel view synthesis, enabling real-time rendering of dynamic scenes and interactive virtual exploration.
  • Figure 2: Architecture of Diff4Splat. We present a high-fidelity dynamic 3DGS generation method from a single image through four key innovations: (1) video diffusion latents processed by our novel Transformer (Sec. \ref{['LDRM']}), (2) a dynamic 3DGS deformation mechanism (Sec. \ref{['Deformable Gaussian Fields']}), (3) unified supervision with photometric, geometric, and motion losses (Sec. \ref{['Training Objective']}), and (4) a progressive training scheme for robust geometry and texture.
  • Figure 3: Qualitative comparison with state-of-the-art methods.Diff4Splat (last column) generates more visually appealing and temporally consistent 4D scenes with superior geometric fidelity compared to baselines. Kindly zoom in for details.
  • Figure 4: Ablation of the Deformation Gaussian Field shows that removing this module (the red bounding boxes) results in ghosting artifacts, particularly in the large motion frames.
  • Figure 5: Ablation on the progressive training strategy.
  • ...and 4 more figures