Table of Contents
Fetching ...

Real-time 3D-aware Portrait Editing from a Single Image

Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, Qifeng Chen

TL;DR

3DPE addresses the challenge of real-time, 3D-consistent portrait editing from a single image by distilling editing priors from a diffusion-based image editor and a 3D GAN into a lightweight module built on Live3D/EG3D. Through a cross-attention-based prompt conditioning and a dual-branch feature strategy, it preserves geometry while enabling appearance edits from both image and text prompts, guided by 2D pseudo-labels and 3D supervision. The approach delivers real-time performance (~40 ms per image), fast customization (~5 minutes) for user-defined prompts, and novel-view consistency that outperforms 2D-first and heavy optimization baselines, with practical applications in AR/VR, teleconferencing, and video editing. It also provides an interactive editing system and efficient adaptation pipelines, though it admits some limitations in fine-grained novel-view details and occasional video flicker.

Abstract

This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our method achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ~5min fine-tuning per style).

Real-time 3D-aware Portrait Editing from a Single Image

TL;DR

3DPE addresses the challenge of real-time, 3D-consistent portrait editing from a single image by distilling editing priors from a diffusion-based image editor and a 3D GAN into a lightweight module built on Live3D/EG3D. Through a cross-attention-based prompt conditioning and a dual-branch feature strategy, it preserves geometry while enabling appearance edits from both image and text prompts, guided by 2D pseudo-labels and 3D supervision. The approach delivers real-time performance (~40 ms per image), fast customization (~5 minutes) for user-defined prompts, and novel-view consistency that outperforms 2D-first and heavy optimization baselines, with practical applications in AR/VR, teleconferencing, and video editing. It also provides an interactive editing system and efficient adaptation pipelines, though it admits some limitations in fine-grained novel-view details and occasional video flicker.

Abstract

This work presents 3DPE, a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and superior editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our method achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified customized types of editing during inference (e.g., with ~5min fine-tuning per style).
Paper Structure (15 sections, 7 equations, 7 figures, 3 tables)

This paper contains 15 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Photorealistic editing results produced by our proposed 3DPE, which allows users to perform 3D-aware portrait editing using image or text prompts. In comparison with baseline methods, such as InstructPix2Pix (IP2P) instructpix2pix+Live3D live3d and PTI pti+StyleCLIP styleclip (details are illustrated in Sec. \ref{['sec:exp']}), our approach accurately follows the guidance from reference prompts and maintains sufficiently better efficiency.
  • Figure 2: Overview of our method. We distill the prior in the diffusion model and 3D GAN for real-time 3D-aware editing. Our approach is fine-tuned from Live3D live3d, where we extract features from the input portrait $\mathrm{I}$ using $\mathrm{E}_{high}(\cdot)$ and $\mathrm{E}_{low}(\cdot)$. The prompt embedding is generated with $\mathrm{E}_{p}(\cdot)$ and injected with the input features from $\mathrm{E}_{high}(\cdot)$ through a cross-attention mechanism. Our model is trained to mimic the output from the diffusion prior to acquire editing knowledge and enforce geometry constraints through triplane, multi-view images, and depth supervision from the 3D prior. In this context, InstructPix2Pix instructpix2pix and Live3D serve as the diffusion and 3D prior, respectively. It's noteworthy that only $\mathrm{E}_{p}(\cdot)$ and $\mathrm{E}_{t}(\cdot)$ are learnable during training, while all other parameters remain frozen.
  • Figure 3: Disentanglement in Live3D features. We separately disable the features from $\mathrm{E}_{high}(\cdot)$ and $\mathrm{E}_{low}(\cdot)$ to infer the reconstructed image. Without $\mathrm{E}_{high}(\cdot)$, the output retains the coarse structure but loses its appearance. Conversely, when $\mathrm{E}_{low}(\cdot)$ is deactivated, the reconstructed portraits preserve the texture (such as the blue and purple reflection on the glasses) but fail to capture the geometry.
  • Figure 4: Qualitative comparisons. We compare the results of several baselines with image prompts and text prompts. In each case, we include the edited portraits as well as their novel view renderings. Our method generates high-quality edited portraits with better 3D quality and alignment with the referenced prompts.
  • Figure 5: (a) Generated results of customized prompt adaptation and (b) its learning process. We show the intermediate testing results at 10s, 1min, 2min and 5min during adaptation for the novel style golden statue.
  • ...and 2 more figures