GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G. Schwing, Shenlong Wang
TL;DR
GoMAvatar tackles monocular 4D human avatar creation by introducing Gaussians-on-Mesh (GoM), an explicit representation that couples triangle-attached Gaussians with a skeleton-driven deformable mesh for real-time rendering. Rendering splits into a pseudo albedo from Gaussian splatting and a pseudo shading from a mesh-derived normal map, enabling view-dependent effects within a rasterization-friendly pipeline; articulation uses forward skinning with a non-rigid deformer and a pose-refinement module to correct SMPL estimates. Key contributions include the GoM representation, a shading module for view dependency, GoM subdivision for geometry refinement, and end-to-end training from monocular video without extra data, achieving up to $43$ FPS and $3.63$ MB per subject with competitive PSNR/SSIM/LPIPS metrics. The approach enables practical, high-quality, real-time animatable avatars from a single video and integrates well with standard graphics engines, highlighting its potential for AR/VR, entertainment, and simulation workflows. Future work may address topology changes and enhanced robustness to diverse clothing and occlusions, further broadening real-world applicability.
Abstract
We introduce GoMAvatar, a novel approach for real-time, memory-efficient, high-quality animatable human modeling. GoMAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints, while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and various YouTube videos. GoMAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject).
