Table of Contents
Fetching ...

Revealing Directions for Text-guided 3D Face Editing

Zhuo Chen, Yichao Yan, Sehngqi Liu, Yuhao Cheng, Weiming Zhao, Lincheng Li, Mengxiao Bi, Xiaokang Yang

TL;DR

The Face Clan is presented, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions that offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

Abstract

3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals. The success of 3D-aware GAN provides expressive 3D models learned from 2D single-view images only, encouraging researchers to discover semantic editing directions in its latent space. However, previous methods face challenges in balancing quality, efficiency, and generalization. To solve the problem, we explore the possibility of introducing the strength of diffusion model into 3D-aware GANs. In this paper, we present Face Clan, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions. To achieve disentangled editing, we propose to diffuse on the latent space under a pair of opposite prompts to estimate the mask indicating the region of interest on latent codes. Based on the mask, we then apply denoising to the masked latent codes to reveal the editing direction. Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description. Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs. It offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

Revealing Directions for Text-guided 3D Face Editing

TL;DR

The Face Clan is presented, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions that offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

Abstract

3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals. The success of 3D-aware GAN provides expressive 3D models learned from 2D single-view images only, encouraging researchers to discover semantic editing directions in its latent space. However, previous methods face challenges in balancing quality, efficiency, and generalization. To solve the problem, we explore the possibility of introducing the strength of diffusion model into 3D-aware GANs. In this paper, we present Face Clan, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions. To achieve disentangled editing, we propose to diffuse on the latent space under a pair of opposite prompts to estimate the mask indicating the region of interest on latent codes. Based on the mask, we then apply denoising to the masked latent codes to reveal the editing direction. Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description. Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs. It offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.
Paper Structure (17 sections, 22 equations, 11 figures, 2 tables)

This paper contains 17 sections, 22 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of our proposed method. (a) The architecture of our based 3D-aware generator, EG3D. (b) The inference pipeline to redistribute the latent code for text-guided synthesis. (c) The illustration of linking text to the region of interest on latent codes. (d) Apply denoising to masked latent codes for disentangled face editing.
  • Figure 2: The illustration of the inspiration that adopts diffusion model as both a controllable data generator and an attribute classifier.
  • Figure 3: Qualitative comparisons with text-guided face editing methods. (a), (b), (c), and (d) are the cases with empirically different difficulties, i.e., easy for beard, normal for sunglasses, hard for hair, and extreme for cap. Our method achieves better results in most cases compared to the other three methods, with higher text consistency and stronger identity preservation.
  • Figure 4: Qualitative comparisons with additional 2D methods. Our method can achieve natural results for both geometry and appearance editing while maintaining good identity consistency.
  • Figure 5: Qualitative comparisons with FaceEdit3D Cheng_2024_CVPR on geometry manipulation.
  • ...and 6 more figures