Table of Contents
Fetching ...

GASP: Gaussian Avatars with Synthetic Priors

Jack Saunders, Charlie Hewitt, Yanan Jian, Marek Kowalski, Tadas Baltrusaitis, Yiye Chen, Darren Cosker, Virginia Estellers, Nicholas Gyde, Vinay P. Namboodiri, Benjamin E Lundell

TL;DR

GASP addresses the challenge of generating photorealistic, animatable avatars from minimal data by training a synthetic prior over Gaussian Avatar parameters using a large synthetic dataset. The method binds per-Gaussian features to a mesh-attached Gaussian representation and employs a three-stage fitting pipeline (inversion, D-finetuning, and Gaussian refinement) to bridge the synthetic-real domain gap and enable 360° rendering. The resulting avatars can be animated and rendered at around 70fps on consumer hardware, stored as compact ~15MB meshes, and support back-of-head reconstruction despite training from frontal views. Across monocular, single-image, and multi-camera evaluations, GASP achieves state-of-the-art or competitive results with reduced artifacts in unseen views, demonstrating practical applicability for VR, video conferencing, and entertainment.

Abstract

Gaussian Splatting has changed the game for real-time photo-realistic rendering. One of the most popular applications of Gaussian Splatting is to create animatable avatars, known as Gaussian Avatars. Recent works have pushed the boundaries of quality and rendering efficiency but suffer from two main limitations. Either they require expensive multi-camera rigs to produce avatars with free-view rendering, or they can be trained with a single camera but only rendered at high quality from this fixed viewpoint. An ideal model would be trained using a short monocular video or image from available hardware, such as a webcam, and rendered from any view. To this end, we propose GASP: Gaussian Avatars with Synthetic Priors. To overcome the limitations of existing datasets, we exploit the pixel-perfect nature of synthetic data to train a Gaussian Avatar prior. By fitting this prior model to a single photo or video and fine-tuning it, we get a high-quality Gaussian Avatar, which supports 360$^\circ$ rendering. Our prior is only required for fitting, not inference, enabling real-time application. Through our method, we obtain high-quality, animatable Avatars from limited data which can be animated and rendered at 70fps on commercial hardware. See our project page (https://microsoft.github.io/GASP/) for results.

GASP: Gaussian Avatars with Synthetic Priors

TL;DR

GASP addresses the challenge of generating photorealistic, animatable avatars from minimal data by training a synthetic prior over Gaussian Avatar parameters using a large synthetic dataset. The method binds per-Gaussian features to a mesh-attached Gaussian representation and employs a three-stage fitting pipeline (inversion, D-finetuning, and Gaussian refinement) to bridge the synthetic-real domain gap and enable 360° rendering. The resulting avatars can be animated and rendered at around 70fps on consumer hardware, stored as compact ~15MB meshes, and support back-of-head reconstruction despite training from frontal views. Across monocular, single-image, and multi-camera evaluations, GASP achieves state-of-the-art or competitive results with reduced artifacts in unseen views, demonstrating practical applicability for VR, video conferencing, and entertainment.

Abstract

Gaussian Splatting has changed the game for real-time photo-realistic rendering. One of the most popular applications of Gaussian Splatting is to create animatable avatars, known as Gaussian Avatars. Recent works have pushed the boundaries of quality and rendering efficiency but suffer from two main limitations. Either they require expensive multi-camera rigs to produce avatars with free-view rendering, or they can be trained with a single camera but only rendered at high quality from this fixed viewpoint. An ideal model would be trained using a short monocular video or image from available hardware, such as a webcam, and rendered from any view. To this end, we propose GASP: Gaussian Avatars with Synthetic Priors. To overcome the limitations of existing datasets, we exploit the pixel-perfect nature of synthetic data to train a Gaussian Avatar prior. By fitting this prior model to a single photo or video and fine-tuning it, we get a high-quality Gaussian Avatar, which supports 360 rendering. Our prior is only required for fitting, not inference, enabling real-time application. Through our method, we obtain high-quality, animatable Avatars from limited data which can be animated and rendered at 70fps on commercial hardware. See our project page (https://microsoft.github.io/GASP/) for results.

Paper Structure

This paper contains 34 sections, 6 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: The overview of our model. In the first stage, we train an autodecoder prior model on Synthetic data to predict the parameters of a mesh attached Gaussian Avatar. We can then adapt this model to user enrollment data, either a single image or short monocular video. We leverage the prior to improve the quality in unseen regions and achieve free-viewpoint rendering.
  • Figure 2: The architecture of our prior model. A latent vector for identity is used to transform learnable per-Gaussian features into Gaussian Attributes, which offset a canonical template. Our training process has four stages: the prior training, P, and three user-specific fitting steps. We freeze some layers and train others at each stage, as indicated.
  • Figure 3: Visualization of the first three components of a PCA decomposition of the Gaussian features $\mathbf{f}$, displayed using the geometry of a random subject. Note the semantic relationships.
  • Figure 4: Examples showing how the three stages in our fitting process resolve the domain gap of the synthetic prior. Stage 1 (Top) optimizes within the prior, Stage 2 (Middle) finetunes the MLP, $\mathcal{D}$, and Stage 3 (Bottom) refines the individual Gaussians. Note the beard and eyes.
  • Figure 5: Examples from our synthetic dataset. We generate a large and diverse set of synthetic subjects rendered from many views to train our prior model.
  • ...and 11 more figures