Table of Contents
Fetching ...

GenEAva: Generating Cartoon Avatars with Fine-Grained Facial Expressions from Realistic Diffusion-based Faces

Hao Yu, Rupayan Mallick, Margrit Betke, Sarah Adel Bargal

TL;DR

GenEAva introduces a diffusion-based pipeline to generate expressive cartoon avatars by fine-tuning SDXL on 135 fine-grained expressions from Emo135, followed by a cartoon-style transfer. A dedicated GenEAva 1.0 dataset (13,230 avatars, 135 expressions) supports diverse and private representations with balanced demographics. The approach achieves superior expression fidelity compared to SDXL on multiple metrics and demonstrates no memorization of training identities through quantitative analyses and user studies, with stylization preserving identity and expression in most cases. This work provides a privacy-conscious, expressive benchmark for cartoon avatar generation and suggests avenues for improved expression control and real-time deployment.

Abstract

Cartoon avatars have been widely used in various applications, including social media, online tutoring, and gaming. However, existing cartoon avatar datasets and generation methods struggle to present highly expressive avatars with fine-grained facial expressions and are often inspired from real-world identities, raising privacy concerns. To address these challenges, we propose a novel framework, GenEAva, for generating high-quality cartoon avatars with fine-grained facial expressions. Our approach fine-tunes a state-of-the-art text-to-image diffusion model to synthesize highly detailed and expressive facial expressions. We then incorporate a stylization model that transforms these realistic faces into cartoon avatars while preserving both identity and expression. Leveraging this framework, we introduce the first expressive cartoon avatar dataset, GenEAva 1.0, specifically designed to capture 135 fine-grained facial expressions, featuring 13,230 expressive cartoon avatars with a balanced distribution across genders, racial groups, and age ranges. We demonstrate that our fine-tuned model generates more expressive faces than the state-of-the-art text-to-image diffusion model SDXL. We also verify that the cartoon avatars generated by our framework do not include memorized identities from fine-tuning data. The proposed framework and dataset provide a diverse and expressive benchmark for future research in cartoon avatar generation.

GenEAva: Generating Cartoon Avatars with Fine-Grained Facial Expressions from Realistic Diffusion-based Faces

TL;DR

GenEAva introduces a diffusion-based pipeline to generate expressive cartoon avatars by fine-tuning SDXL on 135 fine-grained expressions from Emo135, followed by a cartoon-style transfer. A dedicated GenEAva 1.0 dataset (13,230 avatars, 135 expressions) supports diverse and private representations with balanced demographics. The approach achieves superior expression fidelity compared to SDXL on multiple metrics and demonstrates no memorization of training identities through quantitative analyses and user studies, with stylization preserving identity and expression in most cases. This work provides a privacy-conscious, expressive benchmark for cartoon avatar generation and suggests avenues for improved expression control and real-time deployment.

Abstract

Cartoon avatars have been widely used in various applications, including social media, online tutoring, and gaming. However, existing cartoon avatar datasets and generation methods struggle to present highly expressive avatars with fine-grained facial expressions and are often inspired from real-world identities, raising privacy concerns. To address these challenges, we propose a novel framework, GenEAva, for generating high-quality cartoon avatars with fine-grained facial expressions. Our approach fine-tunes a state-of-the-art text-to-image diffusion model to synthesize highly detailed and expressive facial expressions. We then incorporate a stylization model that transforms these realistic faces into cartoon avatars while preserving both identity and expression. Leveraging this framework, we introduce the first expressive cartoon avatar dataset, GenEAva 1.0, specifically designed to capture 135 fine-grained facial expressions, featuring 13,230 expressive cartoon avatars with a balanced distribution across genders, racial groups, and age ranges. We demonstrate that our fine-tuned model generates more expressive faces than the state-of-the-art text-to-image diffusion model SDXL. We also verify that the cartoon avatars generated by our framework do not include memorized identities from fine-tuning data. The proposed framework and dataset provide a diverse and expressive benchmark for future research in cartoon avatar generation.

Paper Structure

This paper contains 23 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The proposed Pipeline, GenEAva, for generating expressive cartoon avatars. During the fine-tuning phase, we train a text-to-image diffusion model using facial expression images. The model is optimized with a combination of diffusion model loss (DM loss) and expression loss computed by an expression encoder. In the inference phase, we generate facial expression images by prompting the model, followed by applying a stylization model to transform them into cartoon avatars.
  • Figure 2: Examples of realistic and stylized images across a variety of facial expressions in GenEAva 1.0. The images illustrate diverse age groups and a balanced representation of race and gender. The stylization effectively preserves the identity and expressions of the realistic images.
  • Figure 3: Interface for the Amazon Mechanical Turk (AMT) user study. The first question addresses the preservation of identity through the stylization module, and the second question addresses the preservation of the facial expression through the stylization module. Each evaluator was presented with 15 such examples, one of which is a test question presenting two images with obviously different identities. This is used to evaluate the validity of the HIT. Nine Turkers were recruited to complete each HIT. Invalid HITs were discarded.
  • Figure 4: User study results evaluating the stylization based on identity preservation and expression preservation. We achieved 96% approval rating in preserving facial expression and 93% approval rating in preserving identity, indicating the effectiveness of the stylization method. The approval rating indicates the percentage (%) of pairs that preserve the facial expression and identity among them.
  • Figure 5: Qualitative examples of images generated by ChatGPT openai2024chatgpt and our proposed GenEAva. GenEAva shows a superior ability to capture subtle expressions compared to ChatGPT, which either produces generic neutral or exaggerated expressions.