Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
TL;DR
This work tackles the challenge of precise user-concept editing in text-to-image diffusion by introducing Custom-Edit, a two-step approach that first customizes a model on a few reference images using language-focused parameters and augmented prompts, then performs text-guided edits with P2P or Null-text Inversion. The key idea is to optimize only language-relevant components (cross-attention keys/values and a rare token) while applying prior-preservation to maintain general language grounding. The authors provide a concrete recipe for customization and editing, compare against baselines like Textual Inversion and Dreambooth, and demonstrate improved reference fidelity with preserved source structure across multiple datasets and editing strengths. The work offers a practical path to high-fidelity, reference-guided edits with limited references and discusses trade-offs, limitations, and future extensions to more capable text encoders and grounding-enabled models.
Abstract
Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. Recent research has extended these models to support text-guided image editing. While text guidance is an intuitive editing interface for users, it often fails to ensure the precise concept conveyed by users. To address this issue, we propose Custom-Edit, in which we (i) customize a diffusion model with a few reference images and then (ii) perform text-guided editing. Our key discovery is that customizing only language-relevant parameters with augmented prompts improves reference similarity significantly while maintaining source similarity. Moreover, we provide our recipe for each customization and editing process. We compare popular customization methods and validate our findings on two editing methods using various datasets.
