3D Cartoon Face Generation with Controllable Expressions from a Single GAN Image
Hao Wang, Wenhao Shen, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
TL;DR
This work tackles generating 3D cartoon avatars from a single GAN-generated 2D face without 3D supervision by learning and exploiting the semantic structure of StyleGAN's latent space. It finetunes a FFHQ-based StyleGAN on cartoon data, links the latent spaces so identical latent codes produce consistent 2D outputs across domains, and uses SeFa-derived directions refined by identity and texture losses to edit expressions while preserving identity. A neural renderer with a VLDA module reconstructs 3D shapes from multiple lighting and viewpoint variations derived from manipulated latent codes, reinforced by symmetry and identity constraints. The approach is evaluated on Disney, MetFaces, and Ukiyo-e datasets, showing improved 2D cartoon quality (FID, perceptual loss) and 3D reconstruction accuracy (SIDE, MAD) over baselines, with ablations confirming the utility of latent-code optimization and the proposed losses. Overall, the method enables controllable 3D cartoon avatars from single 2D GAN images, with practical implications for animation, gaming, and avatar creation without requiring 3D supervision.
Abstract
In this paper, we investigate an open research task of generating 3D cartoon face shapes from single 2D GAN generated human faces and without 3D supervision, where we can also manipulate the facial expressions of the 3D shapes. To this end, we discover the semantic meanings of StyleGAN latent space, such that we are able to produce face images of various expressions, poses, and lighting conditions by controlling the latent codes. Specifically, we first finetune the pretrained StyleGAN face model on the cartoon datasets. By feeding the same latent codes to face and cartoon generation models, we aim to realize the translation from 2D human face images to cartoon styled avatars. We then discover semantic directions of the GAN latent space, in an attempt to change the facial expressions while preserving the original identity. As we do not have any 3D annotations for cartoon faces, we manipulate the latent codes to generate images with different poses and lighting conditions, such that we can reconstruct the 3D cartoon face shapes. We validate the efficacy of our method on three cartoon datasets qualitatively and quantitatively.
