Table of Contents
Fetching ...

Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models

Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon

TL;DR

This work tackles the challenge of precise user-concept editing in text-to-image diffusion by introducing Custom-Edit, a two-step approach that first customizes a model on a few reference images using language-focused parameters and augmented prompts, then performs text-guided edits with P2P or Null-text Inversion. The key idea is to optimize only language-relevant components (cross-attention keys/values and a rare token) while applying prior-preservation to maintain general language grounding. The authors provide a concrete recipe for customization and editing, compare against baselines like Textual Inversion and Dreambooth, and demonstrate improved reference fidelity with preserved source structure across multiple datasets and editing strengths. The work offers a practical path to high-fidelity, reference-guided edits with limited references and discusses trade-offs, limitations, and future extensions to more capable text encoders and grounding-enabled models.

Abstract

Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. Recent research has extended these models to support text-guided image editing. While text guidance is an intuitive editing interface for users, it often fails to ensure the precise concept conveyed by users. To address this issue, we propose Custom-Edit, in which we (i) customize a diffusion model with a few reference images and then (ii) perform text-guided editing. Our key discovery is that customizing only language-relevant parameters with augmented prompts improves reference similarity significantly while maintaining source similarity. Moreover, we provide our recipe for each customization and editing process. We compare popular customization methods and validate our findings on two editing methods using various datasets.

Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models

TL;DR

This work tackles the challenge of precise user-concept editing in text-to-image diffusion by introducing Custom-Edit, a two-step approach that first customizes a model on a few reference images using language-focused parameters and augmented prompts, then performs text-guided edits with P2P or Null-text Inversion. The key idea is to optimize only language-relevant components (cross-attention keys/values and a rare token) while applying prior-preservation to maintain general language grounding. The authors provide a concrete recipe for customization and editing, compare against baselines like Textual Inversion and Dreambooth, and demonstrate improved reference fidelity with preserved source structure across multiple datasets and editing strengths. The work offers a practical path to high-fidelity, reference-guided edits with limited references and discusses trade-offs, limitations, and future extensions to more capable text encoders and grounding-enabled models.

Abstract

Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. Recent research has extended these models to support text-guided image editing. While text guidance is an intuitive editing interface for users, it often fails to ensure the precise concept conveyed by users. To address this issue, we propose Custom-Edit, in which we (i) customize a diffusion model with a few reference images and then (ii) perform text-guided editing. Our key discovery is that customizing only language-relevant parameters with augmented prompts improves reference similarity significantly while maintaining source similarity. Moreover, we provide our recipe for each customization and editing process. We compare popular customization methods and validate our findings on two editing methods using various datasets.
Paper Structure (20 sections, 1 equation, 11 figures)

This paper contains 20 sections, 1 equation, 11 figures.

Figures (11)

  • Figure 1: Our Custom-Edit allows high-fidelity text-guided editing, given a few references. Edited images with BLIP2 li2023blip captions show the limitation of textual guidance in capturing the fine-grained appearance of the reference.
  • Figure 2: Our Custom-Edit consists of two processes: the customization process and the editing process. (a) Customization. We customize a diffusion model by optimizing only language-relevant parameters (i.e., custom embedding V* and attention weights) on a given set of reference images. We also apply the prior preservation loss to alleviate the language drift. (b) Editing. We then transform the source image to the output using the customized word. We leverage the P2P and Null-text inversion methods hertz2022promptmokady2022null for this process.
  • Figure 3: Custom-Edit results. Our method transfers the reference's appearance to the source image with unprecedented fidelity. The structures of the source are well preserved. We obtain source prompts using BLIP2 li2023blip. Except for the pencil drawing example, we use local editing of P2P with automatically generated masks.
  • Figure 4: Source-Reference Trade-Off. Custom-Diffusion shows the best trade-off, indicating the effectiveness of training only language-relevant parameters. We exhibit qualitative comparisons and samples with various strengths in \ref{['sec:appendix-strength']}.
  • Figure A: Additional Custom-Edit results.
  • ...and 6 more figures