Table of Contents
Fetching ...

LEDITS++: Limitless Image Editing using Text-to-Image Models

Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos

TL;DR

LEdits++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps, and its methodology supports multiple simultaneous edits and is architecture-agnostic.

Abstract

Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming finetuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods.

LEDITS++: Limitless Image Editing using Text-to-Image Models

TL;DR

LEdits++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps, and its methodology supports multiple simultaneous edits and is architecture-agnostic.

Abstract

Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming finetuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods.
Paper Structure (51 sections, 19 equations, 22 figures, 2 tables)

This paper contains 51 sections, 19 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: LEdits++ facilitates versatile image-to-image editing. Several complex cases are available now.
  • Figure 2: Comparison of image editing methods. (top) LEdits++ is the only method to restrict edits to the tree leaves and position of the car. (bottom) Ours is the only approach faithfully executing all three edits and keeping changes minimal. (Best viewed in color)
  • Figure 3: Exemplary edit performed with LEdits++ in only 25 diffusion steps with SD1.5. We apply a complex, compounded edit and ground each to a semantically reasonable image region.
  • Figure 4: Semantic segmentation quality of LEdits++. We show the intersection over union (higher is better) for COCO panoptic segmentation. The intersection masks outperform each by a clear margin, close to the CLIPSeg reference. (Best viewed in color)
  • Figure 5: Comparison of instruction-alignment vs. image similarity trade-off for different editing methods. Results were reported for simultaneous manipulation of three facial attributes on CelebA. We plot CLIP scores (higher is better) of the target attributes against LPIPS similarity (lower is better). LEdits++ clearly outperforms all competing methods. (Best viewed in color)
  • ...and 17 more figures