Table of Contents
Fetching ...

ZONE: Zero-Shot Instruction-Guided Local Editing

Shanglin Li, Bohan Zeng, Yutang Feng, Sicheng Gao, Xuhui Liu, Jiaming Liu, Li Lin, Xu Tang, Yao Hu, Jianzhuang Liu, Baochang Zhang

TL;DR

ZONE introduces a zero-shot, instruction-guided local editing framework that locates and edits image regions using a fused IP2P and cross-attention analysis. A Region-IoU scheme with SAM refines the editing mask, while an FFT-based edge smoother enables seamless layer blending to preserve non-edited regions. The method supports single- and multi-turn edits with minimal user input and demonstrates superior fidelity, locality, and stability against state-of-the-art baselines on real and synthetic data. These contributions offer a practical, user-friendly approach for precise region editing in complex images, with broad implications for imaging workflows and content creation. The work also discusses limitations and societal impact, highlighting safeguards against potential misuse.

Abstract

Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.

ZONE: Zero-Shot Instruction-Guided Local Editing

TL;DR

ZONE introduces a zero-shot, instruction-guided local editing framework that locates and edits image regions using a fused IP2P and cross-attention analysis. A Region-IoU scheme with SAM refines the editing mask, while an FFT-based edge smoother enables seamless layer blending to preserve non-edited regions. The method supports single- and multi-turn edits with minimal user input and demonstrates superior fidelity, locality, and stability against state-of-the-art baselines on real and synthetic data. These contributions offer a practical, user-friendly approach for precise region editing in complex images, with broad implications for imaging workflows and content creation. The work also discusses limitations and societal impact, highlighting safeguards against potential misuse.

Abstract

Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.
Paper Structure (45 sections, 9 equations, 13 figures, 5 tables)

This paper contains 45 sections, 9 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: We propose ZONE, a zero-shot instruction-guided local editing approach. Our key idea is to edit and locate precise editing regions in an image with intuitive textual instructions. We demonstrate a multi-turn editing example in (a) and compare the difference maps between the edited image and the original image in (b) to highlight our method's ability for local editing.
  • Figure 2: Overview of ZONE. (a) Three modules in ZONE. (b) The distinct difference between description-guided and instruction-guided diffusion models on cross-attention. The former usually follows a token-aware format, while the latter is edit-aware. $\varnothing$ denotes the unconditional embeddings for null input. (c) Implementation details of the modules shown in (a).
  • Figure 3: Cross-attention map difference. We average the cross-attention maps among all timesteps for each sample. IP2P shows consistency in the overall editing intent with unconditional embeddings $\varnothing$, while Stable Diffusion (SD) demonstrates a one-to-one correspondence with text tokens.
  • Figure 4: Visualization and ablation. The first 4 columns show the intermediate results related to the edge smoother. The last column compares the final edited results with and without the edge smoother.
  • Figure 5: Qualitative comparison. We compare the editing efficacy of our ZONE with existing SOTA methods. The instructions (or instructions that are equivalent to the descriptions required by some baselines) used for editing are written below each row of the images.
  • ...and 8 more figures