Gaussians-to-Life: Text-Driven Animation of 3D Gaussian Splatting Scenes
Thomas Wimmer, Michael Oechsle, Michael Niemeyer, Federico Tombari
TL;DR
This work introduces Gaussians2Life, a training-free framework for text-driven animation of static 3D Gaussian Splatting scenes. It combines diffusion-guided 2D video generation with depth-informed 2D-to-3D lifting to deform Gaussian primitives coherently across views, avoiding heavy per-frame optimization. Key contributions include a diffusion guidance strategy that achieves approximate multi-view consistency, a robust pipeline for lifting 2D motion to 3D via depth, tracking, and deformation transfer, and extensive qualitative validation on real-world datasets against adapted baselines. The approach enables realistic, view-consistent dynamic scenes while preserving original appearance, offering a fast and generalizable path to lively 3D experiences, albeit with limitations related to hole filling, camera motion handling, and diffusion-model biases.
Abstract
State-of-the-art novel view synthesis methods achieve impressive results for multi-view captures of static 3D scenes. However, the reconstructed scenes still lack "liveliness," a key component for creating engaging 3D experiences. Recently, novel video diffusion models generate realistic videos with complex motion and enable animations of 2D images, however they cannot naively be used to animate 3D scenes as they lack multi-view consistency. To breathe life into the static world, we propose Gaussians2Life, a method for animating parts of high-quality 3D scenes in a Gaussian Splatting representation. Our key idea is to leverage powerful video diffusion models as the generative component of our model and to combine these with a robust technique to lift 2D videos into meaningful 3D motion. We find that, in contrast to prior work, this enables realistic animations of complex, pre-existing 3D scenes and further enables the animation of a large variety of object classes, while related work is mostly focused on prior-based character animation, or single 3D objects. Our model enables the creation of consistent, immersive 3D experiences for arbitrary scenes.
