Table of Contents
Fetching ...

Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

Thomas Wimmer, Michael Oechsle, Michael Niemeyer, Federico Tombari

TL;DR

This work introduces Gaussians2Life, a training-free framework for text-driven animation of static 3D Gaussian Splatting scenes. It combines diffusion-guided 2D video generation with depth-informed 2D-to-3D lifting to deform Gaussian primitives coherently across views, avoiding heavy per-frame optimization. Key contributions include a diffusion guidance strategy that achieves approximate multi-view consistency, a robust pipeline for lifting 2D motion to 3D via depth, tracking, and deformation transfer, and extensive qualitative validation on real-world datasets against adapted baselines. The approach enables realistic, view-consistent dynamic scenes while preserving original appearance, offering a fast and generalizable path to lively 3D experiences, albeit with limitations related to hole filling, camera motion handling, and diffusion-model biases.

Abstract

State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack "liveliness," a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.

Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes

TL;DR

This work introduces Gaussians2Life, a training-free framework for text-driven animation of static 3D Gaussian Splatting scenes. It combines diffusion-guided 2D video generation with depth-informed 2D-to-3D lifting to deform Gaussian primitives coherently across views, avoiding heavy per-frame optimization. Key contributions include a diffusion guidance strategy that achieves approximate multi-view consistency, a robust pipeline for lifting 2D motion to 3D via depth, tracking, and deformation transfer, and extensive qualitative validation on real-world datasets against adapted baselines. The approach enables realistic, view-consistent dynamic scenes while preserving original appearance, offering a fast and generalizable path to lively 3D experiences, albeit with limitations related to hole filling, camera motion handling, and diffusion-model biases.

Abstract

State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack "liveliness," a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.

Paper Structure

This paper contains 22 sections, 7 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Our proposed method Gaussians2Life preserves high visual quality of scenes while animating them according to a text prompt. It significantly outperforms a baseline method crafted from DreamGaussian4D ren2023dreamgaussian4d and creates more realistic movements.
  • Figure 2: Improvement of multi-view consistency of generated videos through latent interpolation. In addition to the rendering of the dynamic scene $f$ using the rendering function $g$ from the current viewpoint $g(f)_{s}$, we compute the latent embedding of the warped video output $v_{s-1}$ of the previous optimization step $s-1$ (from a different viewpoint). We linearly interpolate the latents before passing them through the video diffusion model (VDM), which is additionally conditioned on the static scene view from the current viewpoint. The resulting output is finally decoded to a new video output $v_{s}$.
  • Figure 3: Pipeline for lifting 2D dynamics into 3D. Pre-trained models are shown in blue. We detect 2D point tracks and use aligned estimated depth values to lift them into 3D.
  • Figure 4: Comparison of linear and rigid motion estimation. The rigid motion estimation finds a fitting rotation for the source displacements and estimates the displacement for the target point accordingly.
  • Figure 5: Qualitative comparison against ablations on the LEGO bulldozer scene for the prompt "toy bulldozer lifting its shovel."
  • ...and 8 more figures