Table of Contents
Fetching ...

StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model

Peizhi Yan, Rabab K. Ward, Dan Wang, Qiang Tang, Shan Du

TL;DR

StyleMorpheus tackles the challenge of building a 3D-aware morphable face model trained on in-the-wild 2D images without explicit 3D geometry. It introduces a style-based auto-encoder that learns disentangled identity $z_{id}$, expression $z_{expr}$, texture $z_{tex}$, and lighting $z_{light}$ codes, mapped to $w$-spaces and decoded through a low-resolution NeRF-inspired, style-conditioned decoder with modulated layers to achieve real-time, photorealistic rendering. The approach combines a two-stage training regime with photometric, perceptual, segmentation, and code regularization losses, plus adversarial learning, and enables single-image reconstruction, style mixing, and facial-part color editing, outperforming prior 3D-aware parametric models in reconstruction quality. Real-time performance is demonstrated on consumer hardware, with applications in VR and interactive editing, while ablation analyses validate the necessity of adversarial training and encoder-based disentanglement. The work advances practical 3D-aware face modeling by removing the need for explicit 3D shapes and lab-collected data, enabling scalable, editable, and fast 3D face synthesis from in-the-wild imagery.

Abstract

For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: https://github.com/ubc-3d-vision-lab/StyleMorpheus.

StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model

TL;DR

StyleMorpheus tackles the challenge of building a 3D-aware morphable face model trained on in-the-wild 2D images without explicit 3D geometry. It introduces a style-based auto-encoder that learns disentangled identity , expression , texture , and lighting codes, mapped to -spaces and decoded through a low-resolution NeRF-inspired, style-conditioned decoder with modulated layers to achieve real-time, photorealistic rendering. The approach combines a two-stage training regime with photometric, perceptual, segmentation, and code regularization losses, plus adversarial learning, and enables single-image reconstruction, style mixing, and facial-part color editing, outperforming prior 3D-aware parametric models in reconstruction quality. Real-time performance is demonstrated on consumer hardware, with applications in VR and interactive editing, while ablation analyses validate the necessity of adversarial training and encoder-based disentanglement. The work advances practical 3D-aware face modeling by removing the need for explicit 3D shapes and lab-collected data, enabling scalable, editable, and fast 3D face synthesis from in-the-wild imagery.

Abstract

For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: https://github.com/ubc-3d-vision-lab/StyleMorpheus.

Paper Structure

This paper contains 22 sections, 18 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The proposed model architecture follows an auto-encoder structure. The encoder consists of an ArcFace-based face image encoder and four semantic heads to produce identity, expression, texture, and lighting code $z$s in $\mathcal{Z}$ space. Following the semantic heads are mapping networks $M$s that map the corresponding style code to $w$s in $\mathcal{W}$ space. The style codes in $\mathcal{W}$ space are then used in the proposed style-based decoder to reconstruct the input image.
  • Figure 2: The proposed style-based decoder architecture consists of a low-resolution NeRF to produce the 3D-aware feature map (left) and a set of render blocks (right) to produce the high-resolution RGB face image. For convenience, we only show the last "To RGB" layer in this figure. We use modulated layers in both NeRF and render blocks to accept style codes. The identity and expression codes are concatenated to give to the NeRF's modulated layers. The texture and lighting codes are concatenated and used in the render blocks to control the appearance of the generated face.
  • Figure 3: Single image face reconstruction results. The leftmost column shows the target images. We compare with MoFaNeRF zhuang2022mofanerf, 3DMM-RF galanakis20233dmm, and HeadNeRF hong2022headnerf. HeadNeRF (initial code) represents the HeadNeRF's reconstructed face using fitted 3DMM code. Ours (initial code) represents our face reconstruction with our face encoder's estimated style code. The right side shows our reconstructed face rendered in different camera views.
  • Figure 4: Reconstructed faces viewed from different angles.
  • Figure 5: Disentangled style mixing of latent space. The leftmost column shows source images and the first row shows the target images. We divide the target images into three groups: expression, texture, and light. The target image provides the corresponding semantic group of code, and the source image provides the rest semantic groups of code.
  • ...and 5 more figures