Table of Contents
Fetching ...

iHuman: Instant Animatable Digital Humans From Monocular Videos

Pramish Paudel, Anubhav Khanal, Ajad Chhatkuli, Danda Pani Paudel, Jyoti Tandukar

TL;DR

The paper tackles the challenge of instantly creating animatable 3D digital humans from monocular videos. It introduces iHuman, a Gaussian Splat-based framework bound to a SMPL-derived mesh, with vertex displacements, surface-bound Gaussians, and normal-guided supervision to ensure accurate geometry under pose changes. Key contributions include mesh-binding of Gaussians, a displacement encoding pipeline, and explicit normal map supervision, achieving state-of-the-art results with an order-of-magnitude reduction in training time. The approach enables scalable, personalized avatar creation suitable for VR/AR and telepresence, offering fast, high-quality novel-view synthesis from monocular input. The work demonstrates substantial practical impact by reducing compute and time barriers for digitizing animatable humans while preserving fine geometric details.

Abstract

Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses.

iHuman: Instant Animatable Digital Humans From Monocular Videos

TL;DR

The paper tackles the challenge of instantly creating animatable 3D digital humans from monocular videos. It introduces iHuman, a Gaussian Splat-based framework bound to a SMPL-derived mesh, with vertex displacements, surface-bound Gaussians, and normal-guided supervision to ensure accurate geometry under pose changes. Key contributions include mesh-binding of Gaussians, a displacement encoding pipeline, and explicit normal map supervision, achieving state-of-the-art results with an order-of-magnitude reduction in training time. The approach enables scalable, personalized avatar creation suitable for VR/AR and telepresence, offering fast, high-quality novel-view synthesis from monocular input. The work demonstrates substantial practical impact by reducing compute and time barriers for digitizing animatable humans while preserving fine geometric details.

Abstract

Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses.
Paper Structure (25 sections, 16 equations, 10 figures, 4 tables)

This paper contains 25 sections, 16 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Instant Digital Humans. Our method provides detailed and accurate 3D mesh and renderable Gaussian Splats instantly in 15 seconds of training time, from a monocular video. In contrast, the existing methods Anim-NeRF DBLP:conf/cvpr/PengZXWSBZ21 and GART lei_gart_2023 provide lower quality mesh and rendered images, even after using more training time and compute. Input video (left) and rendered poses around the recovered meshes.
  • Figure 2: Training time (mins $\downarrow$) vs. rendering ($\downarrow$) comparison for different methods.
  • Figure 3: Our method represents the human body in canonical space with gaussians parameterized by 3D gaussian centers $x$, rotations $q$, scales $S$, opacity $\alpha_o$, colors $SH$, skinning weight $w$ and its associated parent triangle $i_x$. It takes body pose $\theta_t$ of $t^{th}$ frame as input and applies forward linear blend skinning to transform $v'$ to posed space $v_p$. We compute gaussian center $x$ from the posed space vertices $v_p$ of $i_x$. The normal of parent triangle $i_x$ is encoded to $SH_{\hat{n}}$ and rasterized to obtain the normal map $I_{\hat{n}}$. Then, we apply photometric loss and normal map loss to recover both geometry and color. The GT normal map ($\bar{I}_{\hat{n}}$) is obtained from monocular RGB image ($I_n$) using pix2pixHD wang2018high network.
  • Figure 4: Qualitative results: we obtain fully rigged colored mesh using iHuman since the reconstructed mesh share the same topology with SMPL body model. The obtained meshes are watertight and accurate.
  • Figure 5: Results on MGN dataset bhatnagar2019multi: We compare the ground truth shapes (green) and prediction (yellow) along with corresponding error heatmaps with respect to ground truth shapes (blue represents errors $\leq 1$ cm and red represents errors $\geq 3$ cm).
  • ...and 5 more figures