Table of Contents
Fetching ...

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, Liqiang Nie

TL;DR

GaussianAvatar presents an explicit, animatable 3D Gaussian representation for realistic human avatars from a single video. It combines a pose-conditioned dynamic appearance network with an optimizable global feature tensor to model pose-dependent details and wrinkles, while leveraging forward skinning for reposing. A two-stage training strategy enables joint motion and appearance optimization, improving motion estimates and reducing monocular artifacts. Across multiple datasets, the approach delivers superior appearance quality and efficient rendering, with demonstrated potential for hand animation and out-of-distribution poses.

Abstract

We present GaussianAvatar, an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling, where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover, by leveraging the differentiable motion condition, our method enables a joint optimization of motions and appearances during avatar modeling, which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset, demonstrating its superior performances in terms of appearance quality and rendering efficiency.

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

TL;DR

GaussianAvatar presents an explicit, animatable 3D Gaussian representation for realistic human avatars from a single video. It combines a pose-conditioned dynamic appearance network with an optimizable global feature tensor to model pose-dependent details and wrinkles, while leveraging forward skinning for reposing. A two-stage training strategy enables joint motion and appearance optimization, improving motion estimates and reducing monocular artifacts. Across multiple datasets, the approach delivers superior appearance quality and efficient rendering, with demonstrated potential for hand animation and out-of-distribution poses.

Abstract

We present GaussianAvatar, an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling, where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover, by leveraging the differentiable motion condition, our method enables a joint optimization of motions and appearances during avatar modeling, which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset, demonstrating its superior performances in terms of appearance quality and rendering efficiency.
Paper Structure (23 sections, 4 equations, 11 figures, 6 tables)

This paper contains 23 sections, 4 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We propose GaussianAvatar, which learns animatable 3D Gaussians to represent detailed human avatars from a single video. Our method maintains a 3D consistent appearance even when animated by out-of-distribution motions.
  • Figure 2: Overview of GaussianAvatar. Given a fitted SMPL or SMPL-X model on the current frame, we sample the points on its surface and record their positions on a UV positional map $I$, which is then passed to a pose encoder to obtain the pose feature. An optimizable feature tensor is pixel-aligned with the pose feature and learned to capture the coarse appearance of humans. Then the two aligned feature tensors are input into the Gaussian parameter decoder, which predicts each point's offset $\Delta \hat{ \mathbf{x}}$, color $\hat{ \mathbf{c}}$, and scale $\hat{s}$. These predictions, along with the fixed rotations $\mathbf{q}$ and opacity $\alpha$, collectively constitute the animatable 3D Gaussians in canonical space.
  • Figure 3: Effect of iostropy of 3D Gaussians. (a) Input image, (b)(d) front and back views trained with isotropic 3D Gaussians, (c)(e) front and back views trained with anisotropic 3D Gaussians.
  • Figure 4: Motion optimization results. (a)(d) Original image, (b)(e) our optimized SMPL, (c)(f) ROMP sun2021monocular estimates.
  • Figure 5: Qualitative ablation studies. (a) Ground truth, (b) baseline + Opt. + Dyn., (c) baseline + Opt., (d) baseline.
  • ...and 6 more figures