Table of Contents
Fetching ...

Training-Free Image Editing with Visual Context Integration and Concept Alignment

Rui Song, Guo-Hua Wang, Qing-Guo Chen, Weihua Luo, Tongda Xu, Zhening Liu, Yan Wang, Zehong Lin, Jun Zhang

Abstract

In image editing, it is essential to incorporate a context image to convey the user's precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.

Training-Free Image Editing with Visual Context Integration and Concept Alignment

Abstract

In image editing, it is essential to incorporate a context image to convey the user's precise requirements, such as subject appearance or image style. Existing training-based visual context-aware editing methods incur data collection effort and training cost. On the other hand, the training-free alternatives are typically established on diffusion inversion, which struggles with consistency and flexibility. In this work, we propose VicoEdit, a training-free and inversion-free method to inject the visual context into the pretrained text-prompted editing model. More specifically, VicoEdit directly transforms the source image into the target one based on the visual context, thereby eliminating the need for inversion that can lead to deviated trajectories. Moreover, we design a posterior sampling approach guided by concept alignment to enhance the editing consistency. Empirical results demonstrate that our training-free method achieves even better editing performance than the state-of-the-art training-based models.

Paper Structure

This paper contains 32 sections, 32 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Results of the proposed VicoEdit. The left column of each image pair shows the source and context images, while the right column presents the editing result.
  • Figure 2: The left figure shows the pipeline of FlowEdit. The middle figure illustrates the latent vectors, velocity fields, and sampling trajectory of VicoEdit. The right figure shows the procedure of each sampling step of VicoEdit.
  • Figure 3: Visualization of $\boldsymbol{z}^{tar}_t$ at different timesteps. We visualize the latents from two different trajectories, where the timesteps for starting sampling (i.e., $t_{n_\text{max}}$) are $0.93$ and $0.98$, respectively. The model is instructed to replace the bear with the sloth. The visualization verifies that global features are generated at early steps, and skipping the early stage fails to alter the subject appearance.
  • Figure 4: Editing results with or without concept alignment. Concept alignment preserves details in the source image.
  • Figure 5: Visualization of $\boldsymbol{z}_t$ and $\hat{\boldsymbol{z}}_0$ at different timesteps. Concept alignment guidance accurately predicts $\boldsymbol{z}_0$ even at early timesteps (e.g., when $t=0.9$).
  • ...and 5 more figures