Table of Contents
Fetching ...

BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors

Tingyang Zhang, Qingzhe Gao, Weiyu Li, Libin Liu, Baoquan Chen

TL;DR

The paper tackles animatable 3D reconstruction from monocular video, a setting plagued by limited view coverage and high computational costs. It introduces BAGS, which builds an animatable 3D model using Gaussian Splatting in a canonical space, animated by neural bones, and guided by diffusion priors to compensate unseen viewpoints; a rigid regularization stabilizes training. The approach achieves state-of-the-art geometry, appearance, and animation quality on in-the-wild videos while dramatically reducing training time and enabling real-time rendering on a single GPU. Key contributions include integrating diffusion priors with Gaussian Splatting, introducing neural bones for articulation, applying a rigid loss to curb artifacts, and providing a new dataset and extensive ablations to validate the method.

Abstract

Animatable 3D reconstruction has significant applications across various fields, primarily relying on artists' handcraft creation. Recently, some studies have successfully constructed animatable 3D models from monocular videos. However, these approaches require sufficient view coverage of the object within the input video and typically necessitate significant time and computational costs for training and rendering. This limitation restricts the practical applications. In this work, we propose a method to build animatable 3D Gaussian Splatting from monocular video with diffusion priors. The 3D Gaussian representations significantly accelerate the training and rendering process, and the diffusion priors allow the method to learn 3D models with limited viewpoints. We also present the rigid regularization to enhance the utilization of the priors. We perform an extensive evaluation across various real-world videos, demonstrating its superior performance compared to the current state-of-the-art methods.

BAGS: Building Animatable Gaussian Splatting from a Monocular Video with Diffusion Priors

TL;DR

The paper tackles animatable 3D reconstruction from monocular video, a setting plagued by limited view coverage and high computational costs. It introduces BAGS, which builds an animatable 3D model using Gaussian Splatting in a canonical space, animated by neural bones, and guided by diffusion priors to compensate unseen viewpoints; a rigid regularization stabilizes training. The approach achieves state-of-the-art geometry, appearance, and animation quality on in-the-wild videos while dramatically reducing training time and enabling real-time rendering on a single GPU. Key contributions include integrating diffusion priors with Gaussian Splatting, introducing neural bones for articulation, applying a rigid loss to curb artifacts, and providing a new dataset and extensive ablations to validate the method.

Abstract

Animatable 3D reconstruction has significant applications across various fields, primarily relying on artists' handcraft creation. Recently, some studies have successfully constructed animatable 3D models from monocular videos. However, these approaches require sufficient view coverage of the object within the input video and typically necessitate significant time and computational costs for training and rendering. This limitation restricts the practical applications. In this work, we propose a method to build animatable 3D Gaussian Splatting from monocular video with diffusion priors. The 3D Gaussian representations significantly accelerate the training and rendering process, and the diffusion priors allow the method to learn 3D models with limited viewpoints. We also present the rigid regularization to enhance the utilization of the priors. We perform an extensive evaluation across various real-world videos, demonstrating its superior performance compared to the current state-of-the-art methods.
Paper Structure (20 sections, 10 equations, 7 figures, 1 table)

This paper contains 20 sections, 10 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Given a single casual video, our method constructs an animatable 3D Gaussian Splatting model with diffusion priors. This not only compensates for unseen view information but also enables fast training and real-time rendering.
  • Figure 2: Overview of our method. We construct a canonical space using Gaussian Splatting. In the absence of a templated parametric model, we develop a neural bones representation to animate the canonical space to match the input video. Additionally, we utilize a diffusion model to address unseen view information and apply a rigid constraint to facilitate training. After training, the model can be manually manipulated to achieve novel pose rendering.
  • Figure 3: Qualitative results. Compared with BANMo Banmo_yang2022banmo, our method demonstrates superior fidelity to the input image, exhibiting enhanced geometric detail and texture richness. Moreover, our method achieves better performance in novel view synthesis, where BANMo fails to produce reasonable 3D shapes, instead getting merely a 2D plane that overfits the input video.
  • Figure 4: Animation results. Our method supports manual manipulation for generating animation.
  • Figure 5: Qualitative results. Compared with BANMo Banmo_yang2022banmo, our method demonstrates better performance, exhibiting enhanced geometric detail and texture richness. BANMo fails to produce reasonable 3D shapes, instead getting merely a 2D plane that overfits the input video.
  • ...and 2 more figures