Table of Contents
Fetching ...

LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing

Aoyang Liu, Qingnan Fan, Shuai Qin, Hong Gu, Yansong Tang

TL;DR

This work tackles the challenge of non-rigid image editing while preserving subject identity by learning a personalized identity prior from only a few reference images. It introduces a two-stage LIPE framework: (1) data-augmented learning of a subject-specific prior by fine-tuning a diffusion model on attention—updates limited to the attention layers, and (2) a non-rigid editing mechanism called NIMA that uses identity-aware cross-attention masks to guide latent blending during denoising. The authors also present LIPE, a dedicated dataset spanning objects, animals, and humans, and demonstrate through qualitative and quantitative evaluations that LIPE outperforms strong baselines in identity preservation, background fidelity, and prompt alignment for non-rigid edits. The approach offers a practical path toward controllable, identity-consistent image editing with minimal target subject data, supported by a dataset and comprehensive analyses.

Abstract

Although recent years have witnessed significant advancements in image editing thanks to the remarkable progress of text-to-image diffusion models, the problem of non-rigid image editing still presents its complexities and challenges. Existing methods often fail to achieve consistent results due to the absence of unique identity characteristics. Thus, learning a personalized identity prior might help with consistency in the edited results. In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing. To address the problems in jointly learning prior and editing the image, we present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing. Experimental results demonstrate the advantages of our approach in various editing scenarios over past related leading methods in qualitative and quantitative ways.

LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing

TL;DR

This work tackles the challenge of non-rigid image editing while preserving subject identity by learning a personalized identity prior from only a few reference images. It introduces a two-stage LIPE framework: (1) data-augmented learning of a subject-specific prior by fine-tuning a diffusion model on attention—updates limited to the attention layers, and (2) a non-rigid editing mechanism called NIMA that uses identity-aware cross-attention masks to guide latent blending during denoising. The authors also present LIPE, a dedicated dataset spanning objects, animals, and humans, and demonstrate through qualitative and quantitative evaluations that LIPE outperforms strong baselines in identity preservation, background fidelity, and prompt alignment for non-rigid edits. The approach offers a practical path toward controllable, identity-consistent image editing with minimal target subject data, supported by a dataset and comprehensive analyses.

Abstract

Although recent years have witnessed significant advancements in image editing thanks to the remarkable progress of text-to-image diffusion models, the problem of non-rigid image editing still presents its complexities and challenges. Existing methods often fail to achieve consistent results due to the absence of unique identity characteristics. Thus, learning a personalized identity prior might help with consistency in the edited results. In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing. To address the problems in jointly learning prior and editing the image, we present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing. Experimental results demonstrate the advantages of our approach in various editing scenarios over past related leading methods in qualitative and quantitative ways.
Paper Structure (40 sections, 11 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 11 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Given a few reference images of the same identity, our framework learns a personalized identity prior and applies diverse non-rigid image editing for a test image guided by a textual description, leading to high identity-preserved edited results.
  • Figure 2: The pipeline for data augmentation in learning personalized identity prior. (a) We make detailed editing-oriented captions for reference images by harnessing the large language and vision assistant. (b) We leverage the GPT-4 and pre-trained T2I model to generate diverse editing-oriented text-image pairs for the subject's class, which serves as the regularization dataset.
  • Figure 3: Illustration of Non-rigid Image editing via identity-aware MAsk blend (NIMA). (a) Given a test image, we first invert it to obtain the inverted latents $\{x_i\}$ for image reconstruction, to further obtain the subject mask $M^s$ for the source image. (b) Afterward, to achieve non-rigid image editing, we generate the target image by blending the source $x_t$ and target $\hat{x}_T$ information with the generated masks ($M^s$, $M_t^e$).
  • Figure 4: Identity-aware attention map.
  • Figure 5: Comparisons with previous work on general objects. The red font highlights the editing directions. Left to right: Reference images, Test image, Imagic kawar2023imagic, MasaCtrl cao2023masactrl, DreamCtrl, and Our method.
  • ...and 6 more figures