InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

Jianhui Li; Shilong Liu; Zidong Liu; Yikai Wang; Kaiwen Zheng; Jinghui Xu; Jianmin Li; Jun Zhu

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, Jun Zhu

TL;DR

InstructPix2NeRF tackles instructed 3D-aware portrait editing from a single image by fusing NeRF-based generation with a latent diffusion model operating in the $\mathcal{W+}$ space. It introduces a triplet data mechanism, token position randomization, and an identity consistency module to achieve multi-instruction, identity-preserving editing with 3D consistency guided by CLIP text embeddings. The approach leverages an inversion encoder $E$, a NeRF-based generator $G$, and a diffusion transformer backbone to model $p(w|X_o,T)$, enabling end-to-end editing from human instructions. Empirical results show superior identity preservation and 3D fidelity over baselines, with strong qualitative and quantitative support and potential applicability to interactive VR/metaverse contexts.

Abstract

With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively. Source code and pre-trained models can be found on our project page: \url{https://mybabyyh.github.io/InstructPix2NeRF}.

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

TL;DR

InstructPix2NeRF tackles instructed 3D-aware portrait editing from a single image by fusing NeRF-based generation with a latent diffusion model operating in the

space. It introduces a triplet data mechanism, token position randomization, and an identity consistency module to achieve multi-instruction, identity-preserving editing with 3D consistency guided by CLIP text embeddings. The approach leverages an inversion encoder

, a NeRF-based generator

, and a diffusion transformer backbone to model

, enabling end-to-end editing from human instructions. Empirical results show superior identity preservation and 3D fidelity over baselines, with strong qualitative and quantitative support and potential applicability to interactive VR/metaverse contexts.

Abstract

Paper Structure (24 sections, 5 equations, 18 figures, 9 tables)

This paper contains 24 sections, 5 equations, 18 figures, 9 tables.

Introduction
Related Work
Data Preparation
Method
Conditional Latent 3D diffusion
Token Position Randomization
Identity Consistency Module
Image and Text Conditioning
Experiments
Conclusions
Appendix
Implementation Details
Experiment setting
Training Dataset
Evaluation
...and 9 more sections

Figures (18)

Figure 1: Our instructed 3D-aware portrait editing model allows users to perform interactive global and local editing with human instructions. This can be a single attribute editing or style editing instruction, multiple attribute editing instruction together, or even attribute and style instructions together.
Figure 2: An overview of our conditional latent 3D diffusion model. We use a NeRF-based generator inversion encoder $E$ to obtain the latent code of the image. Then, we train a diffusion model conditioned on human instructions and the original face. Text instruction conditioning is introduced using the cross-attention mechanism with the CLIP text embedding, and the original image conditioning is realized via concatenating and adaptive layer norm. $F$ is a face recognition model. $G$ is a NeRF-based generator. The Diffusion Transformer is trainable, the other models are fixed.
Figure 3: Qualitative comparison. Our method achieves the requirements of the text instruction while preserving identity consistency, especially multiple instruction editing.
Figure 4: Visual improvements of identity modulation and regularization loss.
Figure 5: The improvement rate of AA and CLIP score for our model with a different number of editing instructions against the model without token position randomization training strategy.
...and 13 more figures

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

TL;DR

Abstract

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

Authors

TL;DR

Abstract

Table of Contents

Figures (18)