Table of Contents
Fetching ...

HR Human: Modeling Human Avatars with Triangular Mesh and High-Resolution Textures from Videos

Qifeng Chen, Rengan Xie, Kai Huang, Qi Wang, Wenting Zheng, Rong Li, Yuchi Huo

TL;DR

HR Human addresses the gap between neural implicit avatars and traditional graphics pipelines by producing editable triangular meshes and high-resolution PBR textures from monocular video. The approach fuses temporal information into a deformable neural surface defined by the SDF $f_s$, generates pseudo multi-view supervision, and jointly optimizes geometry, materials, and lighting via differentiable rendering. It further introduces an unbiased mesh refinement step and distills high-frequency texture details through a latent diffusion model, enabling $2048^2$ textures that can be relit and edited in common engines. The resulting avatars demonstrate higher fidelity in geometry and texture compared with prior methods and are readily usable in real-time or offline graphics workflows, with practical implications for AR/VR, film production, and digital humans.

Abstract

Recently, implicit neural representation has been widely used to generate animatable human avatars. However, the materials and geometry of those representations are coupled in the neural network and hard to edit, which hinders their application in traditional graphics engines. We present a framework for acquiring human avatars that are attached with high-resolution physically-based material textures and triangular mesh from monocular video. Our method introduces a novel information fusion strategy to combine the information from the monocular video and synthesize virtual multi-view images to tackle the sparsity of the input view. We reconstruct humans as deformable neural implicit surfaces and extract triangle mesh in a well-behaved pose as the initial mesh of the next stage. In addition, we introduce an approach to correct the bias for the boundary and size of the coarse mesh extracted. Finally, we adapt prior knowledge of the latent diffusion model at super-resolution in multi-view to distill the decomposed texture. Experiments show that our approach outperforms previous representations in terms of high fidelity, and this explicit result supports deployment on common renderers.

HR Human: Modeling Human Avatars with Triangular Mesh and High-Resolution Textures from Videos

TL;DR

HR Human addresses the gap between neural implicit avatars and traditional graphics pipelines by producing editable triangular meshes and high-resolution PBR textures from monocular video. The approach fuses temporal information into a deformable neural surface defined by the SDF , generates pseudo multi-view supervision, and jointly optimizes geometry, materials, and lighting via differentiable rendering. It further introduces an unbiased mesh refinement step and distills high-frequency texture details through a latent diffusion model, enabling textures that can be relit and edited in common engines. The resulting avatars demonstrate higher fidelity in geometry and texture compared with prior methods and are readily usable in real-time or offline graphics workflows, with practical implications for AR/VR, film production, and digital humans.

Abstract

Recently, implicit neural representation has been widely used to generate animatable human avatars. However, the materials and geometry of those representations are coupled in the neural network and hard to edit, which hinders their application in traditional graphics engines. We present a framework for acquiring human avatars that are attached with high-resolution physically-based material textures and triangular mesh from monocular video. Our method introduces a novel information fusion strategy to combine the information from the monocular video and synthesize virtual multi-view images to tackle the sparsity of the input view. We reconstruct humans as deformable neural implicit surfaces and extract triangle mesh in a well-behaved pose as the initial mesh of the next stage. In addition, we introduce an approach to correct the bias for the boundary and size of the coarse mesh extracted. Finally, we adapt prior knowledge of the latent diffusion model at super-resolution in multi-view to distill the decomposed texture. Experiments show that our approach outperforms previous representations in terms of high fidelity, and this explicit result supports deployment on common renderers.
Paper Structure (21 sections, 26 equations, 11 figures, 5 tables)

This paper contains 21 sections, 26 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The overview of HR Human pipeline, which takes a video frame as input to reconstruct explicit avatars with triangular mesh and PBR texture. The pipeline includes deformable neural representations (used to extract volume surfaces and enhance spare input view), explicit representations (texture and geometry are jointly optimized), and super-resolution texture modules (introduced to generate high-resolution textures).
  • Figure 2: Qualitative comparison results of comparison methods, including albedo, geometric normal, and rendered image.
  • Figure 3: Qualitative comparison of the effectiveness of information fusion strategy and unbiased optimization.The highlight generated after mesh overlapping represents geometric bias.
  • Figure 4: Qualitative comparison for the effectiveness of super-resolution. The GT and the rendering results for optimized textures at $512^2$ resolution, optimized textures at $2048^2$ resolution and implicit neural field are shown from left to right separately.
  • Figure 5: Qualitative comparison of the effectiveness of the number of synthesized views used in fusion strategy. From left to right, the number of training views is increasing.
  • ...and 6 more figures