Table of Contents
Fetching ...

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, Siyu Tang

TL;DR

The paper tackles animatable clothed avatar creation from monocular video, a setting where NeRF-based methods struggle with training speed and real-time rendering. It introduces 3D Gaussian Splatting (3DGS) augmented with a pose-conditioned non-rigid deformation network and a rigid skinning module, plus a compact color MLP for view-dependent shading, all rendered via differentiable Gaussian rasterization. Regularization incentives, including as-isometric-as-possible constraints on Gaussian centers and covariances, improve generalization to unseen poses. The result is an avatar system that trains in under 30 minutes on a single GPU and renders at 50+ FPS with competitive or superior rendering quality, outperforming several state-of-the-art baselines in both speed and visual fidelity. This work enables practical, interactive monocular-avatar applications for VR/AR and related domains by leveraging an explicit, fast deformable representation.

Abstract

We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training, and are extremely slow at inference time. Recently, the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training, these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation, we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices, enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input, while being 400x and 250x faster in training and inference, respectively.

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting

TL;DR

The paper tackles animatable clothed avatar creation from monocular video, a setting where NeRF-based methods struggle with training speed and real-time rendering. It introduces 3D Gaussian Splatting (3DGS) augmented with a pose-conditioned non-rigid deformation network and a rigid skinning module, plus a compact color MLP for view-dependent shading, all rendered via differentiable Gaussian rasterization. Regularization incentives, including as-isometric-as-possible constraints on Gaussian centers and covariances, improve generalization to unseen poses. The result is an avatar system that trains in under 30 minutes on a single GPU and renders at 50+ FPS with competitive or superior rendering quality, outperforming several state-of-the-art baselines in both speed and visual fidelity. This work enables practical, interactive monocular-avatar applications for VR/AR and related domains by leveraging an explicit, fast deformable representation.

Abstract

We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training, and are extremely slow at inference time. Recently, the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training, these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation, we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices, enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input, while being 400x and 250x faster in training and inference, respectively.
Paper Structure (39 sections, 13 equations, 12 figures, 8 tables)

This paper contains 39 sections, 13 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: 3DGS-Avatar. We develop an efficient method for creating animatable avatars from monocular videos, leveraging 3D Gaussian Splatting kerbl3Dgaussians. Given a short sequence of dynamic human with a tracked skeleton and foreground masks, our method creates an avatar within 30 minutes on a single GPU, supports animation and novel view synthesis at over 50 FPS, and achieves comparable or better rendering quality to the state-of-the-art weng2022humannerfARAH:ECCV:2022 that requires over 8 GPU days to train, takes several seconds to render a single image, and relies on pre-training on clothed human scans ARAH:ECCV:2022.
  • Figure 2: Our framework for creating animatable avatars from monocular videos. We first initialize a set of 3D Gaussians in the canonical space via sampling points from a SMPL mesh. Each canonical Gaussian $\mathcal{G}_c$ goes through a non-rigid deformation module $\mathcal{F}_{\theta_{nr}}$ conditioned on an encoded pose vector $\mathcal{Z}_p$ (\ref{['sec:non_rigid']}) to account for pose-dependent non-rigid cloth deformation. This module outputs a non-rigidly deformed 3D Gaussian $\mathcal{G}_d$ and a pose-dependent latent feature $\mathbf{z}$. The non-rigidly deformed 3D Gaussian $\mathcal{G}_d$ is transformed to the observation space $\mathcal{G}_o$ (\ref{['sec:rigid']}) via LBS with learned neural skinning $\mathcal{F}_{\theta_r}$. The Gaussian feature $\mathbf{f}$, the pose-dependent feature $\mathbf{z}$, a per-frame latent code $\mathcal{Z}_c$, and the ray direction $\mathbf{d}$ are propagated through a small MLP $\mathcal{F}_{\theta_c}$ to decode the view-dependent color $c$ for each 3D Gaussian. Finally, the observation space 3D Gaussians $\{\mathcal{G}_o\}$ and their respective color values are accumulated via differentiable Gaussian rasterization (\ref{['eq:render']}) to render the image.
  • Figure 3: Qualitative Comparison on ZJU-MoCap peng2020neural. We show the results for both novel view synthesis and novel pose animation of all sequences on ZJU-MoCap. Our method produces high-quality results that preserve cloth details even on out-of-distribution poses.
  • Figure 4: Ablation Study on as-isometric-as-possible regularization, which removes the artifacts on highly articulated poses.
  • Figure 5: Qualitative Ablation of Color MLP.
  • ...and 7 more figures