Table of Contents
Fetching ...

Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models

Rui Jiang, Xinghe Fu, Guangcong Zheng, Teng Li, Taiping Yao, Xi Li

TL;DR

This work addresses personalized image editing with pretrained diffusion models by reframing editing as an energy-guided latent optimization problem conditioned on a reference text-image pair. It introduces EGO-Edit, a training-free and inversion-free framework that combines text-energy guidance for global semantic alignment with image-energy guidance for fine-grained appearance, further enhanced by latent-space content composition and a coarse-to-fine timestepping strategy. The method uses a diffusion-model energy $\\mathcal{E}$ and a descending sequence of timesteps $t_1>t_2>\\dots>t_N$ to progressively refine structure and details, achieving high identity consistency even in cross-class replacements. Extensive experiments on DreamEditBench and PIE-Bench with DINO and CLIP metrics demonstrate state-of-the-art performance and robust ablations validate the importance of IEG, CC, and GS components. This approach offers a practical, scalable solution for personalized editing without retraining diffusion models, with strong potential for real-world content customization.

Abstract

The rapid advancement of pretrained text-driven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.

Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models

TL;DR

This work addresses personalized image editing with pretrained diffusion models by reframing editing as an energy-guided latent optimization problem conditioned on a reference text-image pair. It introduces EGO-Edit, a training-free and inversion-free framework that combines text-energy guidance for global semantic alignment with image-energy guidance for fine-grained appearance, further enhanced by latent-space content composition and a coarse-to-fine timestepping strategy. The method uses a diffusion-model energy and a descending sequence of timesteps to progressively refine structure and details, achieving high identity consistency even in cross-class replacements. Extensive experiments on DreamEditBench and PIE-Bench with DINO and CLIP metrics demonstrate state-of-the-art performance and robust ablations validate the importance of IEG, CC, and GS components. This approach offers a practical, scalable solution for personalized editing without retraining diffusion models, with strong potential for real-world content customization.

Abstract

The rapid advancement of pretrained text-driven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.

Paper Structure

This paper contains 25 sections, 12 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparisons among different methods in personalized content editing. (a) Inpainting-based methods usually require fine-tuning the diffusion model with reference images as the condition. (b) Sampling-based methods initialize the noise with inversion to maintain the background information from the source image. (c) The proposed method iteratively optimizes the latent code to perform training-free and inversion-free editing.
  • Figure 2: Performance overview of the proposed method in image customization editing. Our method generates edited images by integrating contextual guidance with a reference image. The first row demonstrates object replacement within the same category, while the second row shows object replacement across different categories.
  • Figure 3: Pipeline Overview of the Proposed Method: The illustration above outlines the pipeline for our energy-guided optimization method. We construct the energy function derived from the diffusion model, aiming to minimize the energy of the edited image, $\mathbf{\tilde{x}}_t$, to progressively align its distribution with that of the reference image. The diffusion-based energy function is composed of two key components: Text Energy Guidance (TEG) and Image Energy Guidance (IEG). TEG is applied throughout the entire process, ensuring consistent semantic alignment, while IEG is specifically employed during the N2 optimization step to refine visual details, enhancing the fidelity of the edited image to the reference. The processes for both TEG and IEG are detailed below the main pipeline.
  • Figure 4: Visualization of the feature similarity. Given two text-image pairs, referred as source and target. We query the source image with the target text and compute the feature similarity with the source and target under noise addition to different times $t$. We analyze 140 images from different categories and calculate the mean and standard deviation of the feature similarity.
  • Figure 5: Qualitive results of cross-class replacement. The source object and the target object are sampled from different classes.
  • ...and 4 more figures