Text-Driven Image Editing via Learnable Regions

Yuanze Lin; Yi-Wen Chen; Yi-Hsuan Tsai; Lu Jiang; Ming-Hsuan Yang

Text-Driven Image Editing via Learnable Regions

Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang

TL;DR

The paper tackles mask-free, text-driven local image editing by learning edit regions as bounding boxes guided by language prompts. It introduces a region generation network that selects bounding-box edits around anchor points derived from self-attention maps, and integrates with pre-trained editors such as MaskGIT and Stable Diffusion using a CLIP-based training objective with L = $\lambda_C L_{Clip} + \lambda_S L_{Str} + \lambda_D L_{Dir}$ and a ranking score S = $\alpha S_{t2i} + \beta S_{i2i}$ during inference. The approach achieves high fidelity edits that respect complex prompts, demonstrated through qualitative results and a user study where it outperformed several state-of-the-art baselines. By enabling mask-free, region-aware editing compatible with multiple editing models, the method offers a practical, scalable pathway for language-guided image manipulation in real-world applications. $L$ terms and $S$ terms are computed in CLIP space to align visual edits with textual descriptions and preserve source content where appropriate.

Abstract

Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pre-trained text-to-image model and introduces a bounding box generator to identify the editing regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences, or lengthy paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. The experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions. Our project webpage can be found at: https://yuanze-lin.me/LearnableRegions_page.

Text-Driven Image Editing via Learnable Regions

TL;DR

and a ranking score S =

during inference. The approach achieves high fidelity edits that respect complex prompts, demonstrated through qualitative results and a user study where it outperformed several state-of-the-art baselines. By enabling mask-free, region-aware editing compatible with multiple editing models, the method offers a practical, scalable pathway for language-guided image manipulation in real-world applications.

terms and

terms are computed in CLIP space to align visual edits with textual descriptions and preserve source content where appropriate.

Abstract

Paper Structure (13 sections, 6 equations, 6 figures, 1 table)

This paper contains 13 sections, 6 equations, 6 figures, 1 table.

Introduction
Related Work
Text-to-Image Synthesis.
Text-driven Image Manipulation.
Proposed Method
Edit-Region Generation
Training Objectives
Inference
Compatibility with Pretrained Editing Models
Experimental Results
Implementation Details.
Qualitative Evaluation
Comparisons with Prior Work

Figures (6)

Figure 1: Overview. Given an input image and a language description for editing, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context.
Figure 2: Effects of variations in editing regions on generated image quality.Region 1 and Region 2 are two prior regions drawn from the self-attention map of DINO caron2021emerging. Region (ours), shown in the second-to-last column, represents the regions produced by our model which have the best overall quality.
Figure 3: Framework of the proposed method. We first feed the input image into the self-supervised learning (SSL) model, e.g., DINO caron2021emerging, to obtain the attention map and feature, which are used for anchor initialization. The region generation model initializes several region proposals (e.g., 3 proposals in this figure) around each anchor point, and learns to select the most suitable ones among them with the region generation network (RGN). The predicted region and the text descriptions are then fed into a pre-trained text-to-image model for image editing. We utilize the CLIP model for learning the score to measure the similarity between the given text description and the edited result, forming a training signal to learn our region generation model.
Figure 4: Image editing results with simple and complex prompts. Given the input images and prompts, our method edits the image without requiring masks from the users. The learned region is omitted for better visualization. The 1 row contains diverse prompts for one kind of object. The 2 row displays prompts featuring multiple objects. The 3 row shows prompts with geometric relations, and the last row presents prompts with extended length.
Figure 5: Comparison with existing methods. We compare our method with existing text-driven image editing methods. From left to right: Input image, Plug-and-Play tumanyan2023plug, InstructPix2Pix brooks2023instructpix2pix, Null-text mokady2023null, DiffEdit couairon2022diffedit, MasaCtrl cao2023masactrl, and ours.
...and 1 more figures

Text-Driven Image Editing via Learnable Regions

TL;DR

Abstract

Text-Driven Image Editing via Learnable Regions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)