Table of Contents
Fetching ...

Prompt Augmentation for Self-supervised Text-guided Image Manipulation

Rumeysa Bodur, Binod Bhattarai, Tae-Kyun Kim

TL;DR

The paper tackles localized text-guided image manipulation by introducing prompt augmentation to generate multiple target prompts and on-the-fly masks from diffusion noise differences. It offers a self-supervised learning framework using a Contrastive Loss to push edited regions away from the source while keeping unedited regions aligned, and a Soft Contrastive Loss that incorporates prompt similarity via CLIP embeddings. The method demonstrates competitive quantitative metrics and strong qualitative results without requiring masks at inference or paired data, validated through ablations and hyper-parameter studies. This approach advances flexible, context-preserving editing across diverse prompts and images, with practical implications for controllable image manipulation workflows.

Abstract

Text-guided image editing finds applications in various creative and practical fields. While recent studies in image generation have advanced the field, they often struggle with the dual challenges of coherent image transformation and context preservation. In response, our work introduces prompt augmentation, a method amplifying a single input prompt into several target prompts, strengthening textual context and enabling localised image editing. Specifically, we use the augmented prompts to delineate the intended manipulation area. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. Acknowledging the continuous nature of image manipulations, we further refine our approach by incorporating the similarity concept, creating a Soft Contrastive Loss. The new losses are incorporated to the diffusion model, demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches.

Prompt Augmentation for Self-supervised Text-guided Image Manipulation

TL;DR

The paper tackles localized text-guided image manipulation by introducing prompt augmentation to generate multiple target prompts and on-the-fly masks from diffusion noise differences. It offers a self-supervised learning framework using a Contrastive Loss to push edited regions away from the source while keeping unedited regions aligned, and a Soft Contrastive Loss that incorporates prompt similarity via CLIP embeddings. The method demonstrates competitive quantitative metrics and strong qualitative results without requiring masks at inference or paired data, validated through ablations and hyper-parameter studies. This approach advances flexible, context-preserving editing across diverse prompts and images, with practical implications for controllable image manipulation workflows.

Abstract

Text-guided image editing finds applications in various creative and practical fields. While recent studies in image generation have advanced the field, they often struggle with the dual challenges of coherent image transformation and context preservation. In response, our work introduces prompt augmentation, a method amplifying a single input prompt into several target prompts, strengthening textual context and enabling localised image editing. Specifically, we use the augmented prompts to delineate the intended manipulation area. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. Acknowledging the continuous nature of image manipulations, we further refine our approach by incorporating the similarity concept, creating a Soft Contrastive Loss. The new losses are incorporated to the diffusion model, demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches.

Paper Structure

This paper contains 14 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Text-guided image manipulation. Illustrative examples generated by our method (bottom row) with localised manipulations based on given text prompts and input images (top row).
  • Figure 2: Overview of our method. (a) Prompt Augmentation: In order to augment the prompts to facilitate localised image editing we start by refining textual descriptions for source images using the BLIP captioning model li2022blip, resulting in cleaner captions suitable for further processing. Subsequently, we augment this input prompt by generating a range of target prompts using masked language modeling and exploiting word relations. (b) Soft Contrastive Loss (Soft-CL): The augmented prompts are instrumental in computing an attention mask based on the differences between the generated images. This attention mask is used to bring the inverse masked areas of the generated images closer while pushing away masked areas considering their similarity to the prompts.
  • Figure 3: Qualitative comparison of our method against SDEdit meng2022sdedit, DALL-E 2 ramesh2022dalle, DiffEdit diffedit and InstructPixtoPix brooks2022instructpix2pix using both generated and real images.
  • Figure 4: Qualitative comparison of ablation study results.