Table of Contents
Fetching ...

Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions

Chenrui Ma, Xi Xiao, Tianyang Wang, Yanning Shen

TL;DR

The paper tackles instruction-driven image editing without relying on editing-pair datasets by introducing multi-scale learnable regions that localize edits under text guidance. It leverages a pipeline that generates target descriptions via multimodal and language models, fuses image and instruction features with CLIP, and conditions a pre-trained diffusion generator on learned region masks to produce edits. The approach achieves state-of-the-art results across benchmarks without editing pairs, demonstrates compatibility with diverse generative backbones, and scales with abundant text-image data, highlighting a data-efficient path for fine-grained visual editing. This has practical impact for accessible, precise image editing guided by natural language across a wide range of models and tasks.

Abstract

Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.

Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions

TL;DR

The paper tackles instruction-driven image editing without relying on editing-pair datasets by introducing multi-scale learnable regions that localize edits under text guidance. It leverages a pipeline that generates target descriptions via multimodal and language models, fuses image and instruction features with CLIP, and conditions a pre-trained diffusion generator on learned region masks to produce edits. The approach achieves state-of-the-art results across benchmarks without editing pairs, demonstrates compatibility with diverse generative backbones, and scales with abundant text-image data, highlighting a data-efficient path for fine-grained visual editing. This has practical impact for accessible, precise image editing guided by natural language across a wide range of models and tasks.

Abstract

Current text-driven image editing methods typically follow one of two directions: relying on large-scale, high-quality editing pair datasets to improve editing precision and diversity, or exploring alternative dataset-free techniques. However, constructing large-scale editing datasets requires carefully designed pipelines, is time-consuming, and often results in unrealistic samples or unwanted artifacts. Meanwhile, dataset-free methods may suffer from limited instruction comprehension and restricted editing capabilities. Faced with these challenges, the present work develops a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs, instead of relying on editing pair datasets. Our approach introduces a multi-scale learnable region to localize and guide the editing process. By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing. Extensive experiments demonstrate that the proposed approach attains state-of-the-art performance across various tasks and benchmarks, while exhibiting strong adaptability to various types of generative models.

Paper Structure

This paper contains 25 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Framework of the proposed method. Including description text generation, editing feature semantic alignment, learnable edit region prediction, edited image generation and CLIP supervised loss calculation. means the parameters of the component remain fixed, and means the parameters of the component are activated for training. During the inference stage, the components in the gray area will be removed. Stable Diffusion rombach2022high serves as the generative model here; however, various text-to-image generators can be chosen, refer to \ref{['ImplementDetails']} for more information.
  • Figure 2: Multi-Scale Learnable Region. The learnable region adapts to multi-scale editing requirements from different types of editing operations and varying sizes of target objects.
  • Figure 3: Illustration of user preferences for edited results.
  • Figure 4: Comparison of editing results produced by different methods.
  • Figure 5: Editing results produced by our method using different generative models. For each example, from left to right: original image, FLUX flux2024, VAR tian2024visual, and MaskGIT chang2022maskgit.
  • ...and 2 more figures