Table of Contents
Fetching ...

3D Cartoon Face Generation with Controllable Expressions from a Single GAN Image

Hao Wang, Wenhao Shen, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao

TL;DR

This work tackles generating 3D cartoon avatars from a single GAN-generated 2D face without 3D supervision by learning and exploiting the semantic structure of StyleGAN's latent space. It finetunes a FFHQ-based StyleGAN on cartoon data, links the latent spaces so identical latent codes produce consistent 2D outputs across domains, and uses SeFa-derived directions refined by identity and texture losses to edit expressions while preserving identity. A neural renderer with a VLDA module reconstructs 3D shapes from multiple lighting and viewpoint variations derived from manipulated latent codes, reinforced by symmetry and identity constraints. The approach is evaluated on Disney, MetFaces, and Ukiyo-e datasets, showing improved 2D cartoon quality (FID, perceptual loss) and 3D reconstruction accuracy (SIDE, MAD) over baselines, with ablations confirming the utility of latent-code optimization and the proposed losses. Overall, the method enables controllable 3D cartoon avatars from single 2D GAN images, with practical implications for animation, gaming, and avatar creation without requiring 3D supervision.

Abstract

In this paper, we investigate an open research task of generating 3D cartoon face shapes from single 2D GAN generated human faces and without 3D supervision, where we can also manipulate the facial expressions of the 3D shapes. To this end, we discover the semantic meanings of StyleGAN latent space, such that we are able to produce face images of various expressions, poses, and lighting conditions by controlling the latent codes. Specifically, we first finetune the pretrained StyleGAN face model on the cartoon datasets. By feeding the same latent codes to face and cartoon generation models, we aim to realize the translation from 2D human face images to cartoon styled avatars. We then discover semantic directions of the GAN latent space, in an attempt to change the facial expressions while preserving the original identity. As we do not have any 3D annotations for cartoon faces, we manipulate the latent codes to generate images with different poses and lighting conditions, such that we can reconstruct the 3D cartoon face shapes. We validate the efficacy of our method on three cartoon datasets qualitatively and quantitatively.

3D Cartoon Face Generation with Controllable Expressions from a Single GAN Image

TL;DR

This work tackles generating 3D cartoon avatars from a single GAN-generated 2D face without 3D supervision by learning and exploiting the semantic structure of StyleGAN's latent space. It finetunes a FFHQ-based StyleGAN on cartoon data, links the latent spaces so identical latent codes produce consistent 2D outputs across domains, and uses SeFa-derived directions refined by identity and texture losses to edit expressions while preserving identity. A neural renderer with a VLDA module reconstructs 3D shapes from multiple lighting and viewpoint variations derived from manipulated latent codes, reinforced by symmetry and identity constraints. The approach is evaluated on Disney, MetFaces, and Ukiyo-e datasets, showing improved 2D cartoon quality (FID, perceptual loss) and 3D reconstruction accuracy (SIDE, MAD) over baselines, with ablations confirming the utility of latent-code optimization and the proposed losses. Overall, the method enables controllable 3D cartoon avatars from single 2D GAN images, with practical implications for animation, gaming, and avatar creation without requiring 3D supervision.

Abstract

In this paper, we investigate an open research task of generating 3D cartoon face shapes from single 2D GAN generated human faces and without 3D supervision, where we can also manipulate the facial expressions of the 3D shapes. To this end, we discover the semantic meanings of StyleGAN latent space, such that we are able to produce face images of various expressions, poses, and lighting conditions by controlling the latent codes. Specifically, we first finetune the pretrained StyleGAN face model on the cartoon datasets. By feeding the same latent codes to face and cartoon generation models, we aim to realize the translation from 2D human face images to cartoon styled avatars. We then discover semantic directions of the GAN latent space, in an attempt to change the facial expressions while preserving the original identity. As we do not have any 3D annotations for cartoon faces, we manipulate the latent codes to generate images with different poses and lighting conditions, such that we can reconstruct the 3D cartoon face shapes. We validate the efficacy of our method on three cartoon datasets qualitatively and quantitatively.
Paper Structure (20 sections, 17 equations, 9 figures, 2 tables)

This paper contains 20 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of our proposed pipeline, where we train StyleGAN models for human faces and cartoon datasets respectively. Given a single GAN generated human face image, we first discover its corresponding latent codes ${\rm\bf w}$ in the latent space $\mathcal{W}$, where we input ${\rm\bf w}$ to the cartoon generator $G$ and generate the cartoon faces. We then aim to uncover the semantic directions of ${\rm\bf w}$, by which we can manipulate the facial expressions, poses and lighting conditions of the generated images. The manipulated images are fed into neural renderer for 3D reconstruction.
  • Figure 2: The demonstration of the 2D cartoon face generation model training and latent code manipulation process. We first finetune the pretrained FFHQ StyleGAN model on the cartoon dataset. It is notable that we feed the same latent codes into the cartoon generator as that to the human face image generator. We then interpolate the transferred model, such that we can generate photo-realistic cartoon images. In the latent code manipulation phase, we first discover the semantic directions in the latent space $\mathcal{W}$ of the trained StyleGAN model. We then optimize the offset $\Delta {\rm\bf w}$ to the original latent code ${\rm\bf w}$ with the identity loss $\mathcal{L}_{id}$ and low-level feature regularization loss $\mathcal{L}_{low}$.
  • Figure 3: The demonstration of 3D cartoon shape reconstruction process. VLDA stands for the viewpoint, lighting conditions, depth and albedo respectively. Given an input image, we feed it into the renderer with the initial shape prior. We randomly sample lighting conditions and viewpoints, generating rendered images from various viewpoints and lighting. These rendered results from the 3D shapes are further projected back to the latent space $\mathcal{W}$ of StyleGAN. This gives better quality for the projected images, which are used to refine initial shapes. The reconstruction loss is applied on the input and reconstructed images.
  • Figure 4: Visualization of the manipulated facial expressions. In each block, from left to right, we show consistent results of natural human face images, Disney-style, Metfaces-style and Ukiyoe-style images respectively.
  • Figure 5: Comparisons of qualitative results between ours and two related works: (a) Unsup3d wu2020unsupervised and (b) LiftedGAN shi2021lifting.
  • ...and 4 more figures