Table of Contents
Fetching ...

ArtNeRF: A Stylized Neural Field for 3D-Aware Cartoonized Face Synthesis

Zichen Tang, Hongyu Yang

TL;DR

ArtNeRF tackles 3D-aware arbitrary-style face synthesis by extending NeRF-based generation with a style encoder and a Self-Adaptive Style Blending Module. It combines a conditional radiance field with dense skip connections, a lightweight neural rendering module, and a triple-branch discriminator to ensure both style fidelity and multi-view consistency, enabling high-quality, real-time rendering of stylized faces across views. The method leverages a self-supervised style encoder via contrastive learning and a two-stage training strategy to stabilize cross-domain adaptation, achieving better FID, KID, IS, and LPIPS scores than several 2D-guided and early 3D-aware baselines. Overall, ArtNeRF advances 3D-aware cartoonized face synthesis with arbitrary styles and practical rendering speed, opening pathways for AR/VR and content-generation applications, while acknowledging limits on extreme viewpoints and proposing future directions with more expressive 3D representations.

Abstract

Recent advances in generative visual models and neural radiance fields have greatly boosted 3D-aware image synthesis and stylization tasks. However, previous NeRF-based work is limited to single scene stylization, training a model to generate 3D-aware cartoon faces with arbitrary styles remains unsolved. We propose ArtNeRF, a novel face stylization framework derived from 3D-aware GAN to tackle this problem. In this framework, we utilize an expressive generator to synthesize stylized faces and a triple-branch discriminator module to improve the visual quality and style consistency of the generated faces. Specifically, a style encoder based on contrastive learning is leveraged to extract robust low-dimensional embeddings of style images, empowering the generator with the knowledge of various styles. To smooth the training process of cross-domain transfer learning, we propose an adaptive style blending module which helps inject style information and allows users to freely tune the level of stylization. We further introduce a neural rendering module to achieve efficient real-time rendering of images with higher resolutions. Extensive experiments demonstrate that ArtNeRF is versatile in generating high-quality 3D-aware cartoon faces with arbitrary styles.

ArtNeRF: A Stylized Neural Field for 3D-Aware Cartoonized Face Synthesis

TL;DR

ArtNeRF tackles 3D-aware arbitrary-style face synthesis by extending NeRF-based generation with a style encoder and a Self-Adaptive Style Blending Module. It combines a conditional radiance field with dense skip connections, a lightweight neural rendering module, and a triple-branch discriminator to ensure both style fidelity and multi-view consistency, enabling high-quality, real-time rendering of stylized faces across views. The method leverages a self-supervised style encoder via contrastive learning and a two-stage training strategy to stabilize cross-domain adaptation, achieving better FID, KID, IS, and LPIPS scores than several 2D-guided and early 3D-aware baselines. Overall, ArtNeRF advances 3D-aware cartoonized face synthesis with arbitrary styles and practical rendering speed, opening pathways for AR/VR and content-generation applications, while acknowledging limits on extreme viewpoints and proposing future directions with more expressive 3D representations.

Abstract

Recent advances in generative visual models and neural radiance fields have greatly boosted 3D-aware image synthesis and stylization tasks. However, previous NeRF-based work is limited to single scene stylization, training a model to generate 3D-aware cartoon faces with arbitrary styles remains unsolved. We propose ArtNeRF, a novel face stylization framework derived from 3D-aware GAN to tackle this problem. In this framework, we utilize an expressive generator to synthesize stylized faces and a triple-branch discriminator module to improve the visual quality and style consistency of the generated faces. Specifically, a style encoder based on contrastive learning is leveraged to extract robust low-dimensional embeddings of style images, empowering the generator with the knowledge of various styles. To smooth the training process of cross-domain transfer learning, we propose an adaptive style blending module which helps inject style information and allows users to freely tune the level of stylization. We further introduce a neural rendering module to achieve efficient real-time rendering of images with higher resolutions. Extensive experiments demonstrate that ArtNeRF is versatile in generating high-quality 3D-aware cartoon faces with arbitrary styles.
Paper Structure (15 sections, 9 equations, 5 figures, 4 tables)

This paper contains 15 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Multi-view 3D-aware faces with arbitrary styles generated by our model. We evenly select 5 views within a reasonable range where $\rm{pitch}\in[\frac{\pi}{2}-0.2, \frac{\pi}{2}+0.2]$ and $\rm{yaw}\in[\frac{\pi}{2}-0.4, \frac{\pi}{2}+0.4]$.
  • Figure 2: The pipeline of the generator in ArtNeRF. Given an identity code $\boldsymbol{z_f}$ sampled from normal distribution and a style image $\boldsymbol{X_s}$, we first extract the style code using the style encoder $E_s$. Subsequently, dual mapping networks are utilized to map $\boldsymbol{z_f}, \boldsymbol{z_s}$ to $\boldsymbol{W_f}, \boldsymbol{W_s}$ in the $W^+$ space. The self-adaptive SBM module then blends $\boldsymbol{W_f}, \boldsymbol{W_s}$ based on a split index $i$ and injects the style information into the 3D generator. Given camera poses, real-time rendering of 3D-aware stylized faces can be achieved with the dense skip connections and the neural rendering module.
  • Figure 3: The architecture of the style encoder $E_s$.
  • Figure 4: The architecture of the neural rendering module.
  • Figure 5: Qualitative comparison of style-guided face synthesis between AdaIN ref_proc21, MUNIT ref_proc1, FUNIT ref_proc2, DRIT++ ref_proc3, StarGANv2 ref_proc4, BlendGAN ref_proc8 and ours. Note that our model can not only generate reasonable stylized faces, but also produces 3D-aware results with high multi-view consistency.