Table of Contents
Fetching ...

KV-Edit: Training-Free Image Editing for Precise Background Preservation

Tianrui Zhu, Shiyi Zhang, Jiawei Shao, Yansong Tang

TL;DR

KV-Edit presents a training-free image editing framework that strictly preserves background content by caching background key-value tokens during inversion and reusing them during denoising. By decoupling foreground editing from the background through an attention scheme and leveraging a KV cache, it achieves perfect background preservation while enabling flexible edits guided by user prompts. The approach is complemented by mask-guided inversion and reinitialization options, plus an inversion-free variant that reduces memory to O(1), increasing practicality. Extensive PIE-Bench evaluations and user studies demonstrate superior background preservation and competitive image quality relative to training-free and training-based methods, with strong potential for broader applications such as video editing.

Abstract

Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to $O(1)$ using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods. Project webpage is available at https://xilluill.github.io/projectpages/KV-Edit

KV-Edit: Training-Free Image Editing for Precise Background Preservation

TL;DR

KV-Edit presents a training-free image editing framework that strictly preserves background content by caching background key-value tokens during inversion and reusing them during denoising. By decoupling foreground editing from the background through an attention scheme and leveraging a KV cache, it achieves perfect background preservation while enabling flexible edits guided by user prompts. The approach is complemented by mask-guided inversion and reinitialization options, plus an inversion-free variant that reduces memory to O(1), increasing practicality. Extensive PIE-Bench evaluations and user studies demonstrate superior background preservation and competitive image quality relative to training-free and training-based methods, with strong potential for broader applications such as video editing.

Abstract

Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods. Project webpage is available at https://xilluill.github.io/projectpages/KV-Edit

Paper Structure

This paper contains 20 sections, 8 equations, 11 figures, 4 tables, 3 algorithms.

Figures (11)

  • Figure 1: Overview of our proposed KV-Edit. Given an input image and mask, we separate the image into foreground and background. Here, $\mathbf{x}$ and $\mathbf{z}$ denote intermediate results in inversion and denoising processes respectively. Starting from $\mathbf{x}_0$, we first perform inversion to obtain predicted noise $\mathbf{x}_N$ while caching KV pairs. Then, we choose the input $\mathbf{z}^{fg}_N$ and generate edited foreground content $\mathbf{z}^{fg}_0$ based on a new prompt. Finally, we concatenate it with the original background $\mathbf{x}^{bg}_0$ to obtain the edited image with preserved background.
  • Figure 2: The reconstruction error in the inversion-reconstruction process. Starting from the original image $\mathbf{x}_{t_0}$, the inversion process proceeds to $\mathbf{x}_{t_N}$. During inversion process, we use intermediate images $\mathbf{x}_{t_i}$ to reconstruct the original image and calculate the MSE between the reconstructed image $\mathbf{x}_{t_0}^{\prime}$ and the original image $\mathbf{x}_{t_0}$.
  • Figure 3: Analysis of factors affecting background changes. The four images on the right demonstrate how foreground content and condition changes influence the final results.
  • Figure 4: Demonstration of inversion-free KV-Edit. The right panel shows three comparative cases including a failure case, while the left panel illustrates inversion-free approach Significantly optimizes the space complexity to $O(1)$.
  • Figure 5: Qualitative results on PIE-Bench. Unlike existing methods, our method demonstrates superior performance by strictly maintaining background consistency and simultaneously following users' text prompt. The comparison also showcases a user-friendly workflow.
  • ...and 6 more figures