Table of Contents
Fetching ...

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, Yebin Liu

TL;DR

MeshAvatar introduces a hybrid explicit-implicit representation for learning triangular human avatars from multi-view videos, combining an explicit skinned mesh extracted from an implicit SDF via differentiable marching tetrahedra with a pose-dependent implicit material field. High-frequency geometry details are generated by a 2D UNet acting on front/back position maps, and pseudo normal supervision further refines surface quality; materials are inferred with a pose-conditioned feature map and rendered through a differentiable Monte-Carlo path tracer under a low-frequency environment map. The method enables physics-based rendering, relighting, and editing within a traditional graphics pipeline, achieving state-of-the-art geometry and material decomposition on ActorsHQ and AvatarReX, while maintaining end-to-end trainability. Ablation studies validate the contributions of PBR-based rendering and normal supervision, though limitations remain for pose-dependent material realism and loose garment dynamics. Overall, MeshAvatar provides a scalable, editable, and physically grounded framework for dynamic human avatars with strong potential for practical graphics and AR/VR applications.

Abstract

We present a novel pipeline for learning high-quality triangular human avatars from multi-view videos. Recent methods for avatar learning are typically based on neural radiance fields (NeRF), which is not compatible with traditional graphics pipeline and poses great challenges for operations like editing or synthesizing under different environments. To overcome these limitations, our method represents the avatar with an explicit triangular mesh extracted from an implicit SDF field, complemented by an implicit material field conditioned on given poses. Leveraging this triangular avatar representation, we incorporate physics-based rendering to accurately decompose geometry and texture. To enhance both the geometric and appearance details, we further employ a 2D UNet as the network backbone and introduce pseudo normal ground-truth as additional supervision. Experiments show that our method can learn triangular avatars with high-quality geometry reconstruction and plausible material decomposition, inherently supporting editing, manipulation or relighting operations.

MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos

TL;DR

MeshAvatar introduces a hybrid explicit-implicit representation for learning triangular human avatars from multi-view videos, combining an explicit skinned mesh extracted from an implicit SDF via differentiable marching tetrahedra with a pose-dependent implicit material field. High-frequency geometry details are generated by a 2D UNet acting on front/back position maps, and pseudo normal supervision further refines surface quality; materials are inferred with a pose-conditioned feature map and rendered through a differentiable Monte-Carlo path tracer under a low-frequency environment map. The method enables physics-based rendering, relighting, and editing within a traditional graphics pipeline, achieving state-of-the-art geometry and material decomposition on ActorsHQ and AvatarReX, while maintaining end-to-end trainability. Ablation studies validate the contributions of PBR-based rendering and normal supervision, though limitations remain for pose-dependent material realism and loose garment dynamics. Overall, MeshAvatar provides a scalable, editable, and physically grounded framework for dynamic human avatars with strong potential for practical graphics and AR/VR applications.

Abstract

We present a novel pipeline for learning high-quality triangular human avatars from multi-view videos. Recent methods for avatar learning are typically based on neural radiance fields (NeRF), which is not compatible with traditional graphics pipeline and poses great challenges for operations like editing or synthesizing under different environments. To overcome these limitations, our method represents the avatar with an explicit triangular mesh extracted from an implicit SDF field, complemented by an implicit material field conditioned on given poses. Leveraging this triangular avatar representation, we incorporate physics-based rendering to accurately decompose geometry and texture. To enhance both the geometric and appearance details, we further employ a 2D UNet as the network backbone and introduce pseudo normal ground-truth as additional supervision. Experiments show that our method can learn triangular avatars with high-quality geometry reconstruction and plausible material decomposition, inherently supporting editing, manipulation or relighting operations.
Paper Structure (21 sections, 13 equations, 12 figures, 4 tables)

This paper contains 21 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Example results by our method. Given the multi-view videos of a specific subject, our method learns his triangular avatar with the geometry reconstruction and intrinsic material decomposition. After training, the avatar not only supports novel pose synthesis and relighting, but also enables texture editing and material manipulation. In this example, we make the arm metallic and edit the texture of the T-shirt.
  • Figure 2: Illustration of our method. Our pipeline learns a hybrid human avatar represented in the form of (a) an explicit skinned mesh and (b) implicit pose-dependent material fields. Such a representation inherently supports (c) physics-based ray tracing and can be trained in an end-to-end manner using (d) normal estimation as an additional supervision signal.
  • Figure 3: Animation and relighting results. The actors on the same row are driven with the same pose, while the actors on the same column are relighted under the same global illumination.
  • Figure 4: More reconstruction results, including reconstructed geometry, albedo color, material roughness, and light probes.
  • Figure 5: Qualitative comparisons on training frame reconstructions. Our method could reconstruct fine-grained dynamic human geometry.
  • ...and 7 more figures