Table of Contents
Fetching ...

Hands-off Image Editing: Language-guided Editing without any Task-specific Labeling, Masking or even Training

Rodrigo Santos, António Branco, João Silva, João Rodrigues

TL;DR

This work tackles instruction-guided image editing without task-specific labeling, masking, or training by integrating an LLM-driven captioning step with diffusion-based image generation. It builds an edit-direction from CLIP embeddings of before- and after-edit captions, guiding image inversion and reconstruction via a deconstructed input, all in a fully inference-based pipeline. On MAGICBRUSH, the method achieves competitive, sometimes superior, performance versus supervised approaches while avoiding the data-labeling bottlenecks, with Mistral often yielding the best results and Stable Diffusion 1.4 providing a favorable balance. The study also demonstrates gains from prompt simplification and meta-prompting, underscoring the potential for further progress as foundational models continue to evolve, making scalable, language-guided editing increasingly practical.

Abstract

Instruction-guided image editing consists in taking an image and an instruction and deliverring that image altered according to that instruction. State-of-the-art approaches to this task suffer from the typical scaling up and domain adaptation hindrances related to supervision as they eventually resort to some kind of task-specific labelling, masking or training. We propose a novel approach that does without any such task-specific supervision and offers thus a better potential for improvement. Its assessment demonstrates that it is highly effective, achieving very competitive performance.

Hands-off Image Editing: Language-guided Editing without any Task-specific Labeling, Masking or even Training

TL;DR

This work tackles instruction-guided image editing without task-specific labeling, masking, or training by integrating an LLM-driven captioning step with diffusion-based image generation. It builds an edit-direction from CLIP embeddings of before- and after-edit captions, guiding image inversion and reconstruction via a deconstructed input, all in a fully inference-based pipeline. On MAGICBRUSH, the method achieves competitive, sometimes superior, performance versus supervised approaches while avoiding the data-labeling bottlenecks, with Mistral often yielding the best results and Stable Diffusion 1.4 providing a favorable balance. The study also demonstrates gains from prompt simplification and meta-prompting, underscoring the potential for further progress as foundational models continue to evolve, making scalable, language-guided editing increasingly practical.

Abstract

Instruction-guided image editing consists in taking an image and an instruction and deliverring that image altered according to that instruction. State-of-the-art approaches to this task suffer from the typical scaling up and domain adaptation hindrances related to supervision as they eventually resort to some kind of task-specific labelling, masking or training. We propose a novel approach that does without any such task-specific supervision and offers thus a better potential for improvement. Its assessment demonstrates that it is highly effective, achieving very competitive performance.

Paper Structure

This paper contains 43 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Top: Architecture of the method presented in this paper. Bottom: Three examples, showing inputs (left images and edit requests) from the MAGICBRSUH test set and outputs (right images) from our proposed method.
  • Figure 2: Example from MAGICBRUSH test set. The request is "Make the teddy bear black". The four images are: the original one, the one generated from the noise obtained through DDIM Inversion, the one generated by our system, and the gold edited one in the dataset.
  • Figure 3: Examples of different edit-direction weights, with base images and instructions from MAGICBRUSH. Instruction in top row: "Make the woman obese."; in bottom row: "Let's add birds to the sky".
  • Figure 4: Example of the output generated by the language model.
  • Figure 5: Examples from MAGICBRUSH test set (first columns) edited with our method (second columns), InstructPix2Pix (third columns), and ZONE (fourth columns). Edit-requests left side: "Have the cow wear a hat."; "Change the blue and yellow to red and white plane."; "Make the man look to the camera.";"Put a clown face on the mirror.". Edit-requests right-side: "It should have french fries on the plate."; "Add a spider next to the blender."; "Add fire to the buildings."; "Put an exotic planet in the sky.". These images were obtained through 100 DDIM inversion steps, 100 DDIM image generation steps and with captions generated with Mistral.
  • ...and 2 more figures