SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow
Kenan Tang, Yanhong Li, Yao Qin
TL;DR
SPICE tackles the challenge of following detailed, multi-step editing prompts and maintaining high image fidelity across many edits by delivering a training-free workflow built around three components: context-rich mask generation with context dots and soft inpainting, color/edge hints through a hinted image, and a two-stage denoising process that couples a Canny edge ControlNet with a base diffusion model. This combination enables precise, repeatable edits over more than 100 steps while keeping unedited regions intact and supporting arbitrary resolutions. Empirical results on challenging benchmarks show SPICE outperforming strong baselines both quantitatively (CLIP-based metrics) and qualitatively (human judgments), with ablations confirming the necessity of each component. The approach is designed for easy integration into popular diffusion-model UIs and emphasizes accessibility and efficiency, potentially broadening practical use for researchers and artists alike while acknowledging responsible deployment considerations.
Abstract
Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.
