Table of Contents
Fetching ...

Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing

Ekaterina Iakovleva, Fabio Pizzati, Philip Torr, Stéphane Lathuilière

TL;DR

This work tackles the fragility of text-based diffusion editing under ambiguous user prompts by introducing SANE, a zero-shot pipeline that uses a large language model to decompose an ambiguous instruction into a set of specific edits. It then combines these specific instructions with the original prompt inside the diffusion denoising process via a novel noise-aggregation and CFG-based guidance, enabling accurate, interpretable, and diverse edits without model training. The approach yields consistent gains across multiple baselines and datasets, with higher gains as the number of specific instructions increases, and is shown to improve interpretability by exposing the decomposition to users. SANE is broadly applicable to pre-trained instruction-based diffusion models and advances practical image editing by addressing ambiguity in natural language instructions.

Abstract

Text-based editing diffusion models exhibit limited performance when the user's input instruction is ambiguous. To solve this problem, we propose $\textit{Specify ANd Edit}$ (SANE), a zero-shot inference pipeline for diffusion-based editing systems. We use a large language model (LLM) to decompose the input instruction into specific instructions, i.e. well-defined interventions to apply to the input image to satisfy the user's request. We benefit from the LLM-derived instructions along the original one, thanks to a novel denoising guidance strategy specifically designed for the task. Our experiments with three baselines and on two datasets demonstrate the benefits of SANE in all setups. Moreover, our pipeline improves the interpretability of editing models, and boosts the output diversity. We also demonstrate that our approach can be applied to any edit, whether ambiguous or not. Our code is public at https://github.com/fabvio/SANE.

Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing

TL;DR

This work tackles the fragility of text-based diffusion editing under ambiguous user prompts by introducing SANE, a zero-shot pipeline that uses a large language model to decompose an ambiguous instruction into a set of specific edits. It then combines these specific instructions with the original prompt inside the diffusion denoising process via a novel noise-aggregation and CFG-based guidance, enabling accurate, interpretable, and diverse edits without model training. The approach yields consistent gains across multiple baselines and datasets, with higher gains as the number of specific instructions increases, and is shown to improve interpretability by exposing the decomposition to users. SANE is broadly applicable to pre-trained instruction-based diffusion models and advances practical image editing by addressing ambiguity in natural language instructions.

Abstract

Text-based editing diffusion models exhibit limited performance when the user's input instruction is ambiguous. To solve this problem, we propose (SANE), a zero-shot inference pipeline for diffusion-based editing systems. We use a large language model (LLM) to decompose the input instruction into specific instructions, i.e. well-defined interventions to apply to the input image to satisfy the user's request. We benefit from the LLM-derived instructions along the original one, thanks to a novel denoising guidance strategy specifically designed for the task. Our experiments with three baselines and on two datasets demonstrate the benefits of SANE in all setups. Moreover, our pipeline improves the interpretability of editing models, and boosts the output diversity. We also demonstrate that our approach can be applied to any edit, whether ambiguous or not. Our code is public at https://github.com/fabvio/SANE.
Paper Structure (32 sections, 8 equations, 9 figures, 8 tables)

This paper contains 32 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Problem definition. Abstract user instructions may lead existing editing diffusion models to failure (top). SANE solves this problem by decomposing input instructions into specific ones, satisfying the user's request by integrating detailed edits in the editing process (bottom). SANE is completely zero-shot, with no training required.
  • Figure 2: SANE inference pipeline. We prompt an LLM to map an ambiguous input instruction $c$ to a set of specific instructions $\mathcal{S}$ (left). We provide a description of $x$ as context, in addition to $c$. Once $\mathcal{S}$ instructions are extracted, we use them in addition to $c$ in the denoising loop of an editing diffusion model (right). For each iteration, we estimate the noise in $z_{t}$ by conditioning the diffusion U-Net $f_\theta$ on all instructions. We them combine all specific instructions in a single noise estimation, that we later use in classifier-free guidance (CFG). We update the noise $z_t$ to $z_{t-1}$ following standard approaches. After $T$ iterations, we obtain the output image $\tilde{x}=\mathcal{D}(z_0)$.
  • Figure 3: Qualitative results. Using SANE on top of IP2P helps to respect the ambiguous instruction (underlined in grey) by adding specific elements into the input scene. We show how increasing the number of specific instructions (orange, cyan, purple) adds important details to the scene, ignored by the baseline. Examples of such details are the snow on the ground (first row), the herbs garnish (second row), and the taxis (third row). Colored boxed in the header indicate the instruction used for each column.
  • Figure 4: Variability. We quantify variability of generated samples with LPIPS and DreamSim, evaluating the average distance to the input image \ref{['fig:variability-plots']}. For both, higher values imply higher variability. Above Ours bars, we report $N$. In \ref{['fig:variability-viz']}, we show the average pixel difference of original and synthesized images for the reported instruction. Using SANE allows to modify more pixels.
  • Figure 5: t-SNE visualisation of instruction embeddings. We compute embeddings for all instructions from the EMU-Edit dataset and for ambiguous instructions from the IP2P dataset, and we visualise them using t-SNE. We show that embeddings form clusters corresponding to the task types (splits) in EMU-Edit.
  • ...and 4 more figures