Hands-off Image Editing: Language-guided Editing without any Task-specific Labeling, Masking or even Training
Rodrigo Santos, António Branco, João Silva, João Rodrigues
TL;DR
This work tackles instruction-guided image editing without task-specific labeling, masking, or training by integrating an LLM-driven captioning step with diffusion-based image generation. It builds an edit-direction from CLIP embeddings of before- and after-edit captions, guiding image inversion and reconstruction via a deconstructed input, all in a fully inference-based pipeline. On MAGICBRUSH, the method achieves competitive, sometimes superior, performance versus supervised approaches while avoiding the data-labeling bottlenecks, with Mistral often yielding the best results and Stable Diffusion 1.4 providing a favorable balance. The study also demonstrates gains from prompt simplification and meta-prompting, underscoring the potential for further progress as foundational models continue to evolve, making scalable, language-guided editing increasingly practical.
Abstract
Instruction-guided image editing consists in taking an image and an instruction and deliverring that image altered according to that instruction. State-of-the-art approaches to this task suffer from the typical scaling up and domain adaptation hindrances related to supervision as they eventually resort to some kind of task-specific labelling, masking or training. We propose a novel approach that does without any such task-specific supervision and offers thus a better potential for improvement. Its assessment demonstrates that it is highly effective, achieving very competitive performance.
