Table of Contents
Fetching ...

ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering

Haokai Pang, Heming Zhu, Adam Kortylewski, Christian Theobalt, Marc Habermann

TL;DR

ASH tackles real-time photorealistic rendering of animatable clothed humans by representing the actor with a fixed set of 3D Gaussian splats attached to a deformable template mesh. Gaussian parameters are learned in 2D texture space via motion-aware texture decoders, enabling efficient image-space splatting under user-controlled skeletal motion. A two-stage training strategy, combining warmup with pseudo-ground-truth parameters and final pixel- and SSIM-based optimization, yields high-fidelity, motion-dependent appearances while maintaining real-time performance. Empirical results on multi-view datasets show ASH outperforms existing real-time methods by a large margin and closely matches or surpasses several offline approaches, highlighting its potential for interactive avatars in AR/VR and games. Overall, ASH reduces manual labors and provides scalable, controllable, photorealistic rendering of dynamic humans learned exclusively from multi-view videos.

Abstract

Real-time rendering of photorealistic and controllable human avatars stands as a cornerstone in Computer Vision and Graphics. While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars, real-time performance has mostly been demonstrated for static scenes only. To address this, we propose ASH, an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real-time. We parameterize the clothed human as animatable 3D Gaussians, which can be efficiently splatted into image space to generate the final rendering. However, naively learning the Gaussian parameters in 3D space poses a severe challenge in terms of compute. Instead, we attach the Gaussians onto a deformable character model, and learn their parameters in 2D texture space, which allows leveraging efficient 2D convolutional architectures that easily scale with the required number of Gaussians. We benchmark ASH with competing methods on pose-controllable avatars, demonstrating that our method outperforms existing real-time methods by a large margin and shows comparable or even better results than offline methods.

ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering

TL;DR

ASH tackles real-time photorealistic rendering of animatable clothed humans by representing the actor with a fixed set of 3D Gaussian splats attached to a deformable template mesh. Gaussian parameters are learned in 2D texture space via motion-aware texture decoders, enabling efficient image-space splatting under user-controlled skeletal motion. A two-stage training strategy, combining warmup with pseudo-ground-truth parameters and final pixel- and SSIM-based optimization, yields high-fidelity, motion-dependent appearances while maintaining real-time performance. Empirical results on multi-view datasets show ASH outperforms existing real-time methods by a large margin and closely matches or surpasses several offline approaches, highlighting its potential for interactive avatars in AR/VR and games. Overall, ASH reduces manual labors and provides scalable, controllable, photorealistic rendering of dynamic humans learned exclusively from multi-view videos.

Abstract

Real-time rendering of photorealistic and controllable human avatars stands as a cornerstone in Computer Vision and Graphics. While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars, real-time performance has mostly been demonstrated for static scenes only. To address this, we propose ASH, an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real-time. We parameterize the clothed human as animatable 3D Gaussians, which can be efficiently splatted into image space to generate the final rendering. However, naively learning the Gaussian parameters in 3D space poses a severe challenge in terms of compute. Instead, we attach the Gaussians onto a deformable character model, and learn their parameters in 2D texture space, which allows leveraging efficient 2D convolutional architectures that easily scale with the required number of Gaussians. We benchmark ASH with competing methods on pose-controllable avatars, demonstrating that our method outperforms existing real-time methods by a large margin and shows comparable or even better results than offline methods.
Paper Structure (22 sections, 10 equations, 8 figures, 6 tables)

This paper contains 22 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Qualitative Ablation. We compare ASH with the models that take alternative design choices. ASH excels in rendering quality than the model directly learns the Gaussian parameters from 3D canonical space (w/ MLP). Moreover, ASH exhibits robustness against less training views (w/ 12.cam, w/ 30.cam, w/ 60.cam).
  • Figure 2: ASH generates high-fidelity rendering given a skeletal motion and a virtual camera view. A motion-dependent, canonicalized template mesh is generated with a learned deformation network. From the canonical template mesh, we can render the motion-aware textures, which are further adopted for predicting the Gaussian splat parameters with two 2D convolutional networks, i.e., the Geometry and Appearance Decoder, as the texels in the 2D texture space. Through UV mapping and DQ skinning, we warp the Gaussian splats from the canonical space to the posed space. Then, splatting is adopted to render the posed Gaussian splats.
  • Figure 2: ASH conditioned on SMPL. Despite large deviations between the underlying template and the real surface, ASH generates visually plausible results.
  • Figure 3: Qualitative Results. We present the results generated with ASH regarding novel view and pose synthesis. Note that our methods can produce high-quality rendering with delicate, motion-aware details for novel views and skeletal motions.
  • Figure 3: Results with AMASS DanceDB motion. ASH produces photorealistic rendering given the motion from an entirely different dataset.
  • ...and 3 more figures