Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Amandeep Kumar; Muhammad Awais; Sanath Narayan; Hisham Cholakkal; Salman Khan; Rao Muhammad Anwer

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Amandeep Kumar, Muhammad Awais, Sanath Narayan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

TL;DR

This work tackles scalable, text-driven 3D-aware facial editing without training attribute classifiers for each new attribute. It introduces the Latent Attribute Editor (LAE), which uses learnable style tokens and style mappers to translate text prompts into editing directions in the latent space of a frozen 3D GAN (GMPI-based), guided by a CLIP-based directional loss. The LAE is trained with multiple losses for 3D-aware identity and pose preservation, including $L_{dclip}$, $L_{sc}$, $L_{id}$, $L_{idvc}$, $L_{latent}$, and $L_{ ext{alpha}}$, enabling high-quality, view-consistent edits across target poses and novel attributes. The approach is plug-and-play across several 3D generators and achieves fast training (4–8 minutes) with competitive or superior edit quality and identity preservation, greatly expanding the range of editable attributes in 3D facial synthesis.

Abstract

Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others.

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

TL;DR

, and

, enabling high-quality, view-consistent edits across target poses and novel attributes. The approach is plug-and-play across several 3D generators and achieves fast training (4–8 minutes) with competitive or superior edit quality and identity preservation, greatly expanding the range of editable attributes in 3D facial synthesis.

Abstract

Paper Structure (16 sections, 11 equations, 7 figures, 3 tables)

This paper contains 16 sections, 11 equations, 7 figures, 3 tables.

Introduction
Related Works
Method
Problem Formulation
Overview of Proposed Pipeline
Latent Attribute Editor (LAE) with Text Driven Editing
3D-aware Identity and Pose Preservation
Efficiency
Experiment
Quantitative Results
Qualitative Results
Analysis
Plug-and-play
Robustness and biasness
Ablation Study
...and 1 more sections

Figures (7)

Figure 1: The overall architecture of our proposed method is based on attribute-specific prompt learning. Our proposed framework comprises a learnable prompt-based latent attribute editor (LAE), a mapping network $f_{map}$, RGB$\alpha$ generator $f_G$ along with a differentiable renderer $R$. The LAE consists of learnable style tokens, a CLIP-based text encoder $f_T$, and style mappers ($M$). The input-to-text encoder is a textual prompt for $i$-th attribute, which consists of textual prompt $A^i$, a textual instruction $t$, and a learnable token $V_i$. Text encoder converts this to $\Delta v$, which is then fed to style mappers ($M$). Style mappers map $\Delta v$ to the latent space of StyleGAN, which produces RGB images along with alpha maps. These alpha maps, along with target pose $p_t$, are then fed to a renderer that generates an image at a given pose. We introduce a learnable prompt-based attribute editor LAE that enables facial image generation with controllable attributes at different target poses within a single framework.
Figure 2: Our method's performance in face editing is compared qualitatively with both (GMPI zhao2022generative) and state-of-the-art PREIM3D li2023preim3d across various camera angles and attributes. The following attributes were used for comparison: young for age, blond for hair color, and happy for emotion. Additionally, to showcase our method's ability to enable the editing of novel attributes, the final column presents results obtained from custom prompts. Our method not only accurately maintains camera poses about GMPI but also demonstrates superior identity preservation and editing capability in comparison to PREIM3D.
Figure 3:
Figure 4:
Figure 6: Robustness of our model to text corruptions. We introduce standard text corruptions in the text prompt to edit for the orange hair attribute. Our model is robust overall, except when the perturbation alters the keywords of the prompt.
...and 2 more figures

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

TL;DR

Abstract

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)