Table of Contents
Fetching ...

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

Rishab Parthasarathy, Zachary Ankner, Aaron Gokaslan

TL;DR

Vid3D tackles dynamic 3D scene generation by decoupling 2D temporal seeding from per-frame 3D reconstruction, avoiding explicit 3D temporal consistency. It seeds a 2D video from a reference image, expands each timestep into multiple views, and builds a 3D representation per frame using Gaussian splats, yielding a 3D video without modeling temporal dynamics across frames. Quantitatively, Vid3D achieves competitive CLIP-I scores (e.g., 0.8946) compared to state-of-the-art baselines and demonstrates robustness to the number of views used for multi-view synthesis, suggesting 3D temporal knowledge may not be strictly necessary for high-quality dynamic 3D scenes. The approach offers a simpler, scalable alternative that leverages 2D video priors and is open-source for broader adoption and refinement.

Abstract

A recent frontier in computer vision has been the task of 3D video generation, which consists of generating a time-varying 3D representation of a scene. To generate dynamic 3D scenes, current methods explicitly model 3D temporal dynamics by jointly optimizing for consistency across both time and views of the scene. In this paper, we instead investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently. We hence propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video. We evaluate Vid3D against two state-of-the-art 3D video generation methods and find that Vid3D is achieves comparable results despite not explicitly modeling 3D temporal dynamics. We further ablate how the quality of Vid3D depends on the number of views generated per frame. While we observe some degradation with fewer views, performance degradation remains minor. Our results thus suggest that 3D temporal knowledge may not be necessary to generate high-quality dynamic 3D scenes, potentially enabling simpler generative algorithms for this task.

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

TL;DR

Vid3D tackles dynamic 3D scene generation by decoupling 2D temporal seeding from per-frame 3D reconstruction, avoiding explicit 3D temporal consistency. It seeds a 2D video from a reference image, expands each timestep into multiple views, and builds a 3D representation per frame using Gaussian splats, yielding a 3D video without modeling temporal dynamics across frames. Quantitatively, Vid3D achieves competitive CLIP-I scores (e.g., 0.8946) compared to state-of-the-art baselines and demonstrates robustness to the number of views used for multi-view synthesis, suggesting 3D temporal knowledge may not be strictly necessary for high-quality dynamic 3D scenes. The approach offers a simpler, scalable alternative that leverages 2D video priors and is open-source for broader adoption and refinement.

Abstract

A recent frontier in computer vision has been the task of 3D video generation, which consists of generating a time-varying 3D representation of a scene. To generate dynamic 3D scenes, current methods explicitly model 3D temporal dynamics by jointly optimizing for consistency across both time and views of the scene. In this paper, we instead investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently. We hence propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video. We evaluate Vid3D against two state-of-the-art 3D video generation methods and find that Vid3D is achieves comparable results despite not explicitly modeling 3D temporal dynamics. We further ablate how the quality of Vid3D depends on the number of views generated per frame. While we observe some degradation with fewer views, performance degradation remains minor. Our results thus suggest that 3D temporal knowledge may not be necessary to generate high-quality dynamic 3D scenes, potentially enabling simpler generative algorithms for this task.
Paper Structure (16 sections, 10 figures, 3 tables)

This paper contains 16 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: An overview of the Vid3D method. In stage 1, we generate a 2D video from a reference image to seed the dynamics of the scene. In stage 2, we generate multiple views for each timestep in the 2D video. In stage 3, we train a Gaussian splat on the collection of views from each timestep. Ultimately, each trained Gaussian splat represents a timestep in the 3D video.
  • Figure 2: An example of multiple 2D renderings of 3D videos generated by our method. The 3D videos are rendered from two different camera views (y-axis) through time (x-axis). We observe consistency between different camera views for the same timestep as well as plausible dynamics across time for the same view.
  • Figure 3: Rendering of singular frame from 3D videos generated using the same reference image but trained with a varying number of synthesized views. There are no perceivable differences between 18 and 9 views, but there is significant degradation and noise using 3 views.
  • Figure 4: Rendering of singular frame from 3D videos generated by the same reference image but with different amount of motion synthesized. As desired, the higher motion score video has higher variability, along with similar rendering quality to the lower motion score, demonstrating robustness to motion.
  • Figure 5: A qualitative comparison of Animate124, DreamGaussian4D, and Vid3D for a seed image of an astronaut riding a horse. Here, Vid3D both creates accurate representations from multiple angles, but also does not recolor the horse like Animate124 or have worse renders from a non-reference view like DreamGaussian4D.
  • ...and 5 more figures