Table of Contents
Fetching ...

Semantic Human Mesh Reconstruction with Textures

Xiaoyu Zhan, Jianxin Yang, Yuanqi Li, Jie Guo, Yanwen Guo, Wenping Wang

TL;DR

SHERT introduces a pipeline to reconstruct fully textured semantic human avatars from a detailed surface or monocular image by leveraging a subdivided SMPL-X model, semantic- and normal-based sampling, and self-supervised completion and refinement. A UV-domain completion network and a diffusion-based texture model enable high-fidelity textures with stable UV unwrapping and easy animation of the face, body, and hands. SHERT demonstrates robustness to incomplete or inaccurate inputs and achieves text- and image-driven texture generation, including facial details via face substitutions. Quantitative and qualitative results show SHERT outperforms state-of-the-art monocular reconstruction methods on standard datasets and in-the-wild images, suggesting practical utility for industrial avatars and virtual production.

Abstract

The field of 3D detailed human mesh reconstruction has made significant progress in recent years. However, current methods still face challenges when used in industrial applications due to unstable results, low-quality meshes, and a lack of UV unwrapping and skinning weights. In this paper, we present SHERT, a novel pipeline that can reconstruct semantic human meshes with textures and high-precision details. SHERT applies semantic- and normal-based sampling between the detailed surface (e.g. mesh and SDF) and the corresponding SMPL-X model to obtain a partially sampled semantic mesh and then generates the complete semantic mesh by our specifically designed self-supervised completion and refinement networks. Using the complete semantic mesh as a basis, we employ a texture diffusion model to create human textures that are driven by both images and texts. Our reconstructed meshes have stable UV unwrapping, high-quality triangle meshes, and consistent semantic information. The given SMPL-X model provides semantic information and shape priors, allowing SHERT to perform well even with incorrect and incomplete inputs. The semantic information also makes it easy to substitute and animate different body parts such as the face, body, and hands. Quantitative and qualitative experiments demonstrate that SHERT is capable of producing high-fidelity and robust semantic meshes that outperform state-of-the-art methods.

Semantic Human Mesh Reconstruction with Textures

TL;DR

SHERT introduces a pipeline to reconstruct fully textured semantic human avatars from a detailed surface or monocular image by leveraging a subdivided SMPL-X model, semantic- and normal-based sampling, and self-supervised completion and refinement. A UV-domain completion network and a diffusion-based texture model enable high-fidelity textures with stable UV unwrapping and easy animation of the face, body, and hands. SHERT demonstrates robustness to incomplete or inaccurate inputs and achieves text- and image-driven texture generation, including facial details via face substitutions. Quantitative and qualitative results show SHERT outperforms state-of-the-art monocular reconstruction methods on standard datasets and in-the-wild images, suggesting practical utility for industrial avatars and virtual production.

Abstract

The field of 3D detailed human mesh reconstruction has made significant progress in recent years. However, current methods still face challenges when used in industrial applications due to unstable results, low-quality meshes, and a lack of UV unwrapping and skinning weights. In this paper, we present SHERT, a novel pipeline that can reconstruct semantic human meshes with textures and high-precision details. SHERT applies semantic- and normal-based sampling between the detailed surface (e.g. mesh and SDF) and the corresponding SMPL-X model to obtain a partially sampled semantic mesh and then generates the complete semantic mesh by our specifically designed self-supervised completion and refinement networks. Using the complete semantic mesh as a basis, we employ a texture diffusion model to create human textures that are driven by both images and texts. Our reconstructed meshes have stable UV unwrapping, high-quality triangle meshes, and consistent semantic information. The given SMPL-X model provides semantic information and shape priors, allowing SHERT to perform well even with incorrect and incomplete inputs. The semantic information also makes it easy to substitute and animate different body parts such as the face, body, and hands. Quantitative and qualitative experiments demonstrate that SHERT is capable of producing high-fidelity and robust semantic meshes that outperform state-of-the-art methods.
Paper Structure (26 sections, 12 equations, 31 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 12 equations, 31 figures, 2 tables, 1 algorithm.

Figures (31)

  • Figure 1: Semantic Human mEsh Reconstruction with Textures (SHERT): (a) Given a target surface and the corresponding semantic guider, (b) SHERT reconstructs a detailed semantic model, which has stable UV unwrapping and skinning weights with high-quality triangle meshes. (c) It can either sample a texture map from the target surface or generate from the text prompts. (d) Based on our semantic representation, SHERT allows for high-precision facial reconstruction and animation of the body, face, and hands. (e) Moreover, SHERT is capable of inferring a fully textured avatar from a monocular image.
  • Figure 2: Overview of SHERT for monocular image reconstruction. Given an RGB image $\mathcal{I}$, SHERT first infers the detailed mesh $\mathcal{M}_t$ and corresponding sub-SMPLX model $\mathcal{M}_x$. It then applies SNS to obtain the partial semantic mesh $\mathcal{M}_s$ (in Sec. \ref{['sec:m2']}). The Completion Net infers $\mathcal{M}_c$ by filling the UV holes in $\mathcal{M}_s$ (in Sec. \ref{['sec:m3']}). Image and normal maps are utilized to generate $\mathcal{M}_r$, which contains sharper geometry details (in Sec. \ref{['sec:m4']}). Finally, SHERT uses a diffusion model to achieve text-driven texture inpainting and generation (in Sec. \ref{['sec:m5']}).
  • Figure 3: Explicit and implicit sampling. a) SNS shoots a ray from the starting vertex along the vertex normal to locate the intersection point with the target surface. b) SNS takes samples at fixed intervals along the vertex normal ray and search for the point that is closest to the zero isosurface.
  • Figure 4: SNS in the implicit field. We compare the performance of SNS and the Marching Cubes algorithm lorensen1998marching in the implicit field predicted by the multi-view PIFu pifu:iccv:2019. The results show that our SNS and completion network correctly reconstruct the implicit field and obtain a smoother surface compared to Marching Cubes.
  • Figure 5: Completion network. The completion network transforms the result of SNS into canonical space and predicts the holes in the UV domain.
  • ...and 26 more figures