Table of Contents
Fetching ...

GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh

Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G. Schwing, Shenlong Wang

TL;DR

GoMAvatar tackles monocular 4D human avatar creation by introducing Gaussians-on-Mesh (GoM), an explicit representation that couples triangle-attached Gaussians with a skeleton-driven deformable mesh for real-time rendering. Rendering splits into a pseudo albedo from Gaussian splatting and a pseudo shading from a mesh-derived normal map, enabling view-dependent effects within a rasterization-friendly pipeline; articulation uses forward skinning with a non-rigid deformer and a pose-refinement module to correct SMPL estimates. Key contributions include the GoM representation, a shading module for view dependency, GoM subdivision for geometry refinement, and end-to-end training from monocular video without extra data, achieving up to $43$ FPS and $3.63$ MB per subject with competitive PSNR/SSIM/LPIPS metrics. The approach enables practical, high-quality, real-time animatable avatars from a single video and integrates well with standard graphics engines, highlighting its potential for AR/VR, entertainment, and simulation workflows. Future work may address topology changes and enhanced robustness to diverse clothing and occlusions, further broadening real-world applicability.

Abstract

We introduce GoMAvatar, a novel approach for real-time, memory-efficient, high-quality animatable human modeling. GoMAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints, while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and various YouTube videos. GoMAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject).

GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh

TL;DR

GoMAvatar tackles monocular 4D human avatar creation by introducing Gaussians-on-Mesh (GoM), an explicit representation that couples triangle-attached Gaussians with a skeleton-driven deformable mesh for real-time rendering. Rendering splits into a pseudo albedo from Gaussian splatting and a pseudo shading from a mesh-derived normal map, enabling view-dependent effects within a rasterization-friendly pipeline; articulation uses forward skinning with a non-rigid deformer and a pose-refinement module to correct SMPL estimates. Key contributions include the GoM representation, a shading module for view dependency, GoM subdivision for geometry refinement, and end-to-end training from monocular video without extra data, achieving up to FPS and MB per subject with competitive PSNR/SSIM/LPIPS metrics. The approach enables practical, high-quality, real-time animatable avatars from a single video and integrates well with standard graphics engines, highlighting its potential for AR/VR, entertainment, and simulation workflows. Future work may address topology changes and enhanced robustness to diverse clothing and occlusions, further broadening real-world applicability.

Abstract

We introduce GoMAvatar, a novel approach for real-time, memory-efficient, high-quality animatable human modeling. GoMAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints, while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and various YouTube videos. GoMAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject).
Paper Structure (24 sections, 22 equations, 14 figures, 9 tables)

This paper contains 24 sections, 22 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: GoMAvatar takes a monocular RGB video (left) as input to establish an explicit and accurate 4D representation of a dynamic human. It can render efficiently at novel views and poses with state-of-the-art quality. Additionally, it is extremely compact ( 3.63 MB per subject), efficient ( 43 FPS), and seamlessly compatible with the graphics pipeline such as OpenGL.
  • Figure 2: Our approach is simultaneously faster (represented by $x$ coordinates of circle centers , smaller is better), memory-efficient (represented by circle size, smaller is better), and renders at a higher quality (represented by $y$ coordinates of circle centers, higher is better). The horizontal brown line denotes our PSNR.
  • Figure 3: Gaussians-on-Mesh (GoM). We learn Gaussians in the local coordinates of each triangle and transform them to the world coordinate based on the triangle's shape. We initialize the rotation $r_{\theta,j} \in so(3)$ to zeros and scale $s_{\theta, j} \in \mathbb{R}^3$ to ones so that we start with a Gaussian that's thin along the normal axis of the triangle. Meanwhile, the projection of the ellipsoid $\{x: (x-\mu_j)^T \Sigma_j^{-1}(x-\mu_j)=1\}$ on the triangle recovers the Steiner ellipse. See Sec. \ref{['sec:pointrep']} and the appendix for details.
  • Figure 4: Qualitative comparison to state-of-the-arts. In each pair, we render the RGB image and normal map. The normal map is rendered from the extracted mesh. We show that our approach can produce realistic details in both rendered images and geometry, while other approaches struggle to generate a smooth mesh.
  • Figure 5: Qualitative results on YouTube videos. The first image is the reference image. We compare novel view synthesis in the first row and novel pose synthesis in the second row.
  • ...and 9 more figures