Table of Contents
Fetching ...

Concept Lancet: Image Editing with Compositional Representation Transplant

Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Hancheng Min, Chris Callison-Burch, René Vidal

TL;DR

CoLan addresses the challenge of editing diffusion-based images with variable concept presence by learning a rich, compositional latent dictionary (CoLan-150K) and performing sparse decomposition to estimate edit magnitudes. At inference, a source latent is expressed as a sparse combination of concept vectors, enabling precise transplant of target concepts via replacement (or insertion/removal as special cases) in the latent space, with v' = D' w* + r. The approach yields state-of-the-art editing effectiveness and consistency across backbones while remaining plug-and-play and zero-shot, with results grounded in a large, diverse concept dataset and robust grounding analyses. This has practical impact for controllable, high-fidelity image editing across diverse scenes while highlighting a scalable path for concept-aware diffusion manipulation.

Abstract

Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

Concept Lancet: Image Editing with Compositional Representation Transplant

TL;DR

CoLan addresses the challenge of editing diffusion-based images with variable concept presence by learning a rich, compositional latent dictionary (CoLan-150K) and performing sparse decomposition to estimate edit magnitudes. At inference, a source latent is expressed as a sparse combination of concept vectors, enabling precise transplant of target concepts via replacement (or insertion/removal as special cases) in the latent space, with v' = D' w* + r. The approach yields state-of-the-art editing effectiveness and consistency across backbones while remaining plug-and-play and zero-shot, with results grounded in a large, diverse concept dataset and robust grounding analyses. This has practical impact for controllable, high-fidelity image editing across diverse scenes while highlighting a scalable path for concept-aware diffusion manipulation.

Abstract

Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

Paper Structure

This paper contains 19 sections, 6 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Given a source image and the editing task, our proposed CoLan generates a concept dictionary and performs sparse decomposition in the latent space to precisely transplant the target concept.
  • Figure 2: Representation manipulation in diffusion models involves adding an accurate magnitude of edit direction (e.g., Image (3) by CoLan) to the latent source representation. Figure \ref{['fig:visual_comparison_p2pzero']} and Figure \ref{['fig:visual_comparison_infedit']} show more examples.
  • Figure 3: The CoLan framework. Starting with a source image and prompt, a vision-language model extracts visual concepts (e.g., cat, grass, sitting) to construct a concept dictionary. The source representation is then decomposed along this dictionary, and the target concept (dog) is transplanted to replace the corresponding atom to achieve precise edits. Finally, the image editing backbone generates an edited image where the desired target concept is incorporated without disrupting other visual elements.
  • Figure 4: Samples of the concept stimuli from CoLan-150K. Additional samples are attached in the Appendix §\ref{['sec:appendix_additional_results']}.
  • Figure 5: Visual comparisons of CoLan in the text embedding space of P2P-Zero. Texts in gray are the original captions of the source images from PIE-Bench, and texts in blue are the corresponding edit task (replace, add, remove). [x] represents the concepts of interest, and [] represents the null concept.
  • ...and 12 more figures