Table of Contents
Fetching ...

SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation

Peng Zheng, Tao Liu, Zili Yi, Rui Ma

TL;DR

SemanticHuman-HD is the first method to achieve semantic disentangled human image synthesis and is also the first method to achieve 3D-aware image synthesis at $1024^2 resolution, benefiting from the proposed 3D-aware super-resolution module.

Abstract

With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at $512^2$ resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.

SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation

TL;DR

SemanticHuman-HD is the first method to achieve semantic disentangled human image synthesis and is also the first method to achieve 3D-aware image synthesis at $1024^2 resolution, benefiting from the proposed 3D-aware super-resolution module.

Abstract

With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.
Paper Structure (49 sections, 10 equations, 14 figures)

This paper contains 49 sections, 10 equations, 14 figures.

Figures (14)

  • Figure 1: (a) Semantic-aware virtual try-on. Given a real image, we first employ GAN inversion to obtain its semantic latent code. Subsequently, we replace the top and bottom garment by manipulating the semantic latent code. Here, the top is randomly generated by our model, and the bottom is disentangled from another GAN inversion result. (b) Controllable image synthesis. Our method allows for generating the same person in different poses as well as rendering them from different viewpoints.
  • Figure 2: Pipeline of SemanticHuman-HD. In stage 1, given random noise $z$, the Semantic Mapper maps it to $K$ latent code $W_k$, conditioned on human pose $P$ and semantic label $L_s$. Each local generator $G_k$ then maps $W_k$ into a tri-plane representation $T_k^{256}$. For each pixel in the synthesized image, we sample 72 points in posed space, which are subsequently deformed into canonical space using the deformer. These sampled points allow us to interpolate within the tri-plane representation, obtaining color and density information for each point. Finally, the Semantic Renderer renders the image, depth map, semantic mask, and normal map at $256^2$ resolution. In Stage 2, we employ a convolutional network to obtain high-resolution tri-plane representations, denoted as $T_k^{1024}$. To enhance efficiency, we significantly reduce the number of sampling points per pixel using semantic and depth-guided sampling. Ultimately, we render the image and normal map at $1024^2$ resolution.
  • Figure 3: Quantitative comparisons. $^\star$ denotes the use of a super-resolution module that is not 3D-aware. Some results are marked with $^\divideontimes$, indicating that these results are quoted from other papers because the authors did not release their training code or pre-trained model. A: Local editing. B: Semantic-aware synthesis. C: Semantic disentangled synthesis. D: 3D Garment generation.
  • Figure 4: Qualitative comparison. To better assess the detailed quality of the generated results, we zoom in on the face and clothing areas in the synthesized images. Notably, the image synthesized by our method is at $1024^2$ resolution, whereas the results from other methods are only at $512^2$ resolution.
  • Figure 5: Comparison for local editing. For each edited image, we zoom in on key areas to demonstrate the editing capabilities.
  • ...and 9 more figures