Table of Contents
Fetching ...

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Keqiang Sun, Amin Jourabloo, Riddhish Bhalodia, Moustafa Meshry, Yu Rong, Zhengyu Yang, Thu Nguyen-Phuoc, Christian Haene, Jiu Xu, Sam Johnson, Hongsheng Li, Sofien Bouaziz

TL;DR

GenCA introduces a two-stage, text-conditioned framework for photo-realistic, editable, and drivable 3D avatars. The Codec Avatar Auto-Encoder (CAAE) learns geometry/texture latent spaces and leverages a Universal Prior Model for expression, while the Identity Generation Model uses a latent diffusion process with Geometry Generation (GM) and Geometry Conditioned Texture Generation (GCTM) to produce coherent, editable identities from natural language prompts. This enables sampling new identities, single-shot avatar reconstruction, and downstream editing with high fidelity, including regions like the eyes and mouth interior. The approach demonstrates superior visual quality, driveability, and editability compared with state-of-the-art methods, highlighting its practical impact for VR/AR, film, and gaming pipelines.

Abstract

Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production. Traditional methods for avatar creation often involve time-consuming scanning and reconstruction processes for each avatar, which limits their scalability. Furthermore, these methods do not offer the flexibility to sample new identities or modify existing ones. On the other hand, by learning a strong prior from data, generative models provide a promising alternative to traditional reconstruction methods, easing the time constraints for both data capture and processing. Additionally, generative methods enable downstream applications beyond reconstruction, such as editing and stylization. Nonetheless, the research on generative 3D avatars is still in its infancy, and therefore current methods still have limitations such as creating static avatars, lacking photo-realism, having incomplete facial details, or having limited drivability. To address this, we propose a text-conditioned generative model that can generate photo-realistic facial avatars of diverse identities, with more complete details like hair, eyes and mouth interior, and which can be driven through a powerful non-parametric latent expression space. Specifically, we integrate the generative and editing capabilities of latent diffusion models with a strong prior model for avatar expression driving. Our model can generate and control high-fidelity avatars, even those out-of-distribution. We also highlight its potential for downstream applications, including avatar editing and single-shot avatar reconstruction.

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

TL;DR

GenCA introduces a two-stage, text-conditioned framework for photo-realistic, editable, and drivable 3D avatars. The Codec Avatar Auto-Encoder (CAAE) learns geometry/texture latent spaces and leverages a Universal Prior Model for expression, while the Identity Generation Model uses a latent diffusion process with Geometry Generation (GM) and Geometry Conditioned Texture Generation (GCTM) to produce coherent, editable identities from natural language prompts. This enables sampling new identities, single-shot avatar reconstruction, and downstream editing with high fidelity, including regions like the eyes and mouth interior. The approach demonstrates superior visual quality, driveability, and editability compared with state-of-the-art methods, highlighting its practical impact for VR/AR, film, and gaming pipelines.

Abstract

Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production. Traditional methods for avatar creation often involve time-consuming scanning and reconstruction processes for each avatar, which limits their scalability. Furthermore, these methods do not offer the flexibility to sample new identities or modify existing ones. On the other hand, by learning a strong prior from data, generative models provide a promising alternative to traditional reconstruction methods, easing the time constraints for both data capture and processing. Additionally, generative methods enable downstream applications beyond reconstruction, such as editing and stylization. Nonetheless, the research on generative 3D avatars is still in its infancy, and therefore current methods still have limitations such as creating static avatars, lacking photo-realism, having incomplete facial details, or having limited drivability. To address this, we propose a text-conditioned generative model that can generate photo-realistic facial avatars of diverse identities, with more complete details like hair, eyes and mouth interior, and which can be driven through a powerful non-parametric latent expression space. Specifically, we integrate the generative and editing capabilities of latent diffusion models with a strong prior model for avatar expression driving. Our model can generate and control high-fidelity avatars, even those out-of-distribution. We also highlight its potential for downstream applications, including avatar editing and single-shot avatar reconstruction.
Paper Structure (37 sections, 14 equations, 12 figures, 3 tables)

This paper contains 37 sections, 14 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Generative Codec Avatars. Given a sentence describing the attributes of a face, our method generates a Codec Avatar, which can be driven by realistic expressions (top). GenCA has many downstream applications such as avatar reconstruction from a single in-the-wild image (bottom). Additionally, it allows for editing features, such as changing hair color to green (top) or removing facial hair (bottom).
  • Figure 2: Main CAAE Framework for learning the latent space for geometry and texture of avatars.
  • Figure 3: Training Pipeline of the Identity Generation Model, Geometry generator Module (GM): Generates $z_{geo}$ of realistic geometries based on text descriptions. Geometry Conditioned Texture Generation (GCTM): Generates $z_{tex}$ of high quality texture, consistent with conditioned geometry, based on the text descriptions.
  • Figure 4: Smooth linear interpolation among the geometry and texture latent codes.
  • Figure 5: Generation Results: Qualitative results generated from the captions provided in leftmost column.
  • ...and 7 more figures