Table of Contents
Fetching ...

SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow

Kenan Tang, Yanhong Li, Yao Qin

TL;DR

SPICE tackles the challenge of following detailed, multi-step editing prompts and maintaining high image fidelity across many edits by delivering a training-free workflow built around three components: context-rich mask generation with context dots and soft inpainting, color/edge hints through a hinted image, and a two-stage denoising process that couples a Canny edge ControlNet with a base diffusion model. This combination enables precise, repeatable edits over more than 100 steps while keeping unedited regions intact and supporting arbitrary resolutions. Empirical results on challenging benchmarks show SPICE outperforming strong baselines both quantitatively (CLIP-based metrics) and qualitatively (human judgments), with ablations confirming the necessity of each component. The approach is designed for easy integration into popular diffusion-model UIs and emphasizes accessibility and efficiency, potentially broadening practical use for researchers and artists alike while acknowledging responsible deployment considerations.

Abstract

Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.

SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow

TL;DR

SPICE tackles the challenge of following detailed, multi-step editing prompts and maintaining high image fidelity across many edits by delivering a training-free workflow built around three components: context-rich mask generation with context dots and soft inpainting, color/edge hints through a hinted image, and a two-stage denoising process that couples a Canny edge ControlNet with a base diffusion model. This combination enables precise, repeatable edits over more than 100 steps while keeping unedited regions intact and supporting arbitrary resolutions. Empirical results on challenging benchmarks show SPICE outperforming strong baselines both quantitatively (CLIP-based metrics) and qualitatively (human judgments), with ablations confirming the necessity of each component. The approach is designed for easy integration into popular diffusion-model UIs and emphasizes accessibility and efficiency, potentially broadening practical use for researchers and artists alike while acknowledging responsible deployment considerations.

Abstract

Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.

Paper Structure

This paper contains 68 sections, 3 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: SPICE enables a user to edit the image exactly as they want, and image details outside the edited region are strictly intact after many editing steps. The first row shows the full image of a 3000$\times$2000 resolution. The second row shows a 900$\times$600 region enlarged for better visibility. In this example, a user uses 9 editing steps to perform various edits, including structure change, object removal, object addition, object replacement, text addition, color change, and detail fixes. Steps 7, 8, and 9 together fix the fridge structure. The labels above the first row are the abbreviated version of the true editing instructions (more details in \ref{['app:hints-and-prompts-fridge']}). For example, in the "Add a Word" column, the user wants to add the specific word "suspicious" to the white bowl. The prompt is "An open fridge with food in it. A bowl with a word 'suspicious' on it." The result aligns with the user's requirement.
  • Figure 2: By sketching a binary mask with context dots and a color & edge hint, users can effortlessly achieve realistic edits with SPICE. Subfigure (a) shows the overview of our workflow, while Subfigures (b) and (c) show the internal steps. In this example, the user requires a sunhat to be added next to the woman. First, the user sketches both a mask with context dots and a hinted image containing color and edge hints. The mask is automatically blurred after being sketched. Then, during the two-stage denoising step, the Canny and base models perform the early and late denoising steps, respectively. See \ref{['fig:simple-editing']} for more examples of masks and hints.
  • Figure 3: Our workflow outperforms baseline methods in 6 editing categories from EditEval. Each group of five images shows an example from an editing task. Each group starts from original image, followed by IP2P, MagicBrush (MB), UltraEdit (UE), and our results.
  • Figure 4: SPICE can generate content that DALL·E 3 and GPT-4o cannot. Two examples are a rabbit with 4 ears and a violin without a bridge. For individual objects, DALL·E 3 fails even after the user asks the model to edit the errors multiple times. For combined objects, GPT-4o cannot generate all objects correctly at once. However, SPICE can reliably generate these challenging objects.
  • Figure 5: Our workflow outperforms baseline methods in 6 editing categories from EditEval. Each group of five images shows an example from an editing task. Each group starts from original image, followed by IP2P, MagicBrush (MB), UltraEdit (UE), and our results.
  • ...and 22 more figures