Table of Contents
Fetching ...

ColorEdit: Training-free Image-Guided Color editing with diffusion model

Xingxi Yin, Zhi Li, Jingfeng Zhang, Chenglin Li, Yin Zhang

TL;DR

This work tackles color editing of objects in diffusion-generated images without training, addressing cross-attention leakage and attribute collision between object features and color prompts. It analyzes how object shape is established in early denoising and leverages AdaIN-based cross-attention Value alignment with a reference color image, complemented by self-attention map replacement and latent blending to preserve background. A training-free framework, ColorEdit, is introduced along with COLORBENCH, a new benchmark for color-change evaluation, demonstrating competitive or superior performance against state-of-the-art editing methods on both synthetic and real images. The approach offers practical, training-free color editing capabilities while outlining limitations for small objects and multi-object edits and pointing to future improvements.

Abstract

Text-to-image (T2I) diffusion models, with their impressive generative capabilities, have been adopted for image editing tasks, demonstrating remarkable efficacy. However, due to attention leakage and collision between the cross-attention map of the object and the new color attribute from the text prompt, text-guided image editing methods may fail to change the color of an object, resulting in a misalignment between the resulting image and the text prompt. In this paper, we conduct an in-depth analysis on the process of text-guided image synthesizing and what semantic information different cross-attention blocks have learned. We observe that the visual representation of an object is determined in the up-block of the diffusion model in the early stage of the denoising process, and color adjustment can be achieved through value matrices alignment in the cross-attention layer. Based on our findings, we propose a straightforward, yet stable, and effective image-guided method to modify the color of an object without requiring any additional fine-tuning or training. Lastly, we present a benchmark dataset called COLORBENCH, the first benchmark to evaluate the performance of color change methods. Extensive experiments validate the effectiveness of our method in object-level color editing and surpass the performance of popular text-guided image editing approaches in both synthesized and real images.

ColorEdit: Training-free Image-Guided Color editing with diffusion model

TL;DR

This work tackles color editing of objects in diffusion-generated images without training, addressing cross-attention leakage and attribute collision between object features and color prompts. It analyzes how object shape is established in early denoising and leverages AdaIN-based cross-attention Value alignment with a reference color image, complemented by self-attention map replacement and latent blending to preserve background. A training-free framework, ColorEdit, is introduced along with COLORBENCH, a new benchmark for color-change evaluation, demonstrating competitive or superior performance against state-of-the-art editing methods on both synthetic and real images. The approach offers practical, training-free color editing capabilities while outlining limitations for small objects and multi-object edits and pointing to future improvements.

Abstract

Text-to-image (T2I) diffusion models, with their impressive generative capabilities, have been adopted for image editing tasks, demonstrating remarkable efficacy. However, due to attention leakage and collision between the cross-attention map of the object and the new color attribute from the text prompt, text-guided image editing methods may fail to change the color of an object, resulting in a misalignment between the resulting image and the text prompt. In this paper, we conduct an in-depth analysis on the process of text-guided image synthesizing and what semantic information different cross-attention blocks have learned. We observe that the visual representation of an object is determined in the up-block of the diffusion model in the early stage of the denoising process, and color adjustment can be achieved through value matrices alignment in the cross-attention layer. Based on our findings, we propose a straightforward, yet stable, and effective image-guided method to modify the color of an object without requiring any additional fine-tuning or training. Lastly, we present a benchmark dataset called COLORBENCH, the first benchmark to evaluate the performance of color change methods. Extensive experiments validate the effectiveness of our method in object-level color editing and surpass the performance of popular text-guided image editing approaches in both synthesized and real images.

Paper Structure

This paper contains 8 sections, 8 equations, 22 figures, 5 tables, 2 algorithms.

Figures (22)

  • Figure 1: Multi-Object color editing.Each outcome image is changing the color of the hat first and then changing the color of the bowl and coat, using the associated reference color image.
  • Figure 2: Example of color change.Text-guided editing methods may fail to change the color of an object while maintaining the structure of it or the background information.
  • Figure 3: The Image-guided Color Editing Framework. Our framework including: (a) Color Image Inversion. The reference color image is inverted to initial noise and extracted the Value matrices of the cross-attention layers. (b) Source Image inversion. Extracting the self-attention map and latent $Z^s_T, ..., Z^s_0$ of the source image. (c) Denoising Process, which including latent blending at the begining, cross-attention layer value matrices alignment in the early stage, self-attention map replacement through the whole process, and background preserve in the last few steps.
  • Figure 4: Cross-attention map of objects in the decoder of U-net. We visualize the average cross-attention maps of various objects across all timesteps. As observed, the shape, contour, and texture of an object are determined in the U-Net decoder.
  • Figure 5: Cross-attention map of object in different denoising steps in the decoder of U-net. We visualize the cross-attention maps of an object at various diffusion steps within the U-Net decoder.
  • ...and 17 more figures