Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar
TL;DR
The paper tackles instruction-based image inpainting by removing objects solely from textual prompts without masks. It introduces Inst-Inpaint, a latent diffusion framework, and GQA-Inpaint, a real-image dataset built from scene graphs to train and evaluate text-guided removal. Through extensive comparisons against diffusion-based and GAN-based baselines on real and synthetic data, the approach achieves superior realism and removal accuracy, validating the feasibility and practicality of text-driven object erasure. The work also highlights attention-driven localization and provides a dataset and analysis toolkit to spur further research in instruction-based image editing.
Abstract
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
