Table of Contents
Fetching ...

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar

TL;DR

The paper tackles instruction-based image inpainting by removing objects solely from textual prompts without masks. It introduces Inst-Inpaint, a latent diffusion framework, and GQA-Inpaint, a real-image dataset built from scene graphs to train and evaluate text-guided removal. Through extensive comparisons against diffusion-based and GAN-based baselines on real and synthetic data, the approach achieves superior realism and removal accuracy, validating the feasibility and practicality of text-driven object erasure. The work also highlights attention-driven localization and provides a dataset and analysis toolkit to spur further research in instruction-based image editing.

Abstract

Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

TL;DR

The paper tackles instruction-based image inpainting by removing objects solely from textual prompts without masks. It introduces Inst-Inpaint, a latent diffusion framework, and GQA-Inpaint, a real-image dataset built from scene graphs to train and evaluate text-guided removal. Through extensive comparisons against diffusion-based and GAN-based baselines on real and synthetic data, the approach achieves superior realism and removal accuracy, validating the feasibility and practicality of text-driven object erasure. The work also highlights attention-driven localization and provides a dataset and analysis toolkit to spur further research in instruction-based image editing.

Abstract

Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
Paper Structure (11 sections, 2 equations, 12 figures, 4 tables)

This paper contains 11 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Instructional Image Inpainting. We propose a new image inpainting method, Inst-Inpaint, which takes an image and a textual instruction as input and is able to automatically remove objects mentioned in the text, without any need for a user-supplied binary mask to locate the region of interest, as in traditional inpainting approaches. We show sample results of our method trained and tested on a new real image inpainting benchmark dataset, GQA-Instruct, which we created for the proposed instructional image inpainting task.
  • Figure 2: The proposed GQA-Inpaint dataset and our Inst-Inpaint method. Our work involves initially generating a dataset for the proposed instructional image inpainting task. To create input/output pairs, we utilize the images and their scene graphs that exist in the GQA dataset hudson2019gqa. (a) We first select an object of interest. (b) We perform instance segmentation to locate the object in the image. (c) We apply a state-of-the-art image inpainting method to erase the object. (d) Finally, we create a template-based textual prompt to describe the removal operation. As a result, our GQA-Inpaint dataset includes a total of 147165 unique images and 41407 different instructions. Trained on this dataset, our Inst-Inpaint model is a text-based image inpainting method based on a conditioned Latent Diffusion Model rombach2022high which does not require any user-specified binary mask and performs object removal in a single step without predicting a mask, as in similar works.
  • Figure 3: A sample image from the GQA dataset hudson2019gqa and the corresponding scene graph.
  • Figure 4: Distribution of the relation types exist in the proposed GQA-Inpaint dataset (sorted by their number of occurrences).
  • Figure 5: Comparison of Mask Extraction methods. The vocabulary of the COCO dataset is different from the vocabulary of the GQA dataset. To extract the correct segmentation masks, we make use of a combination of pretrained models available in Detectron2's Model Zoo and use the vocabulary generated based on the scenes to predict the corresponding class labels using the Detic2 framework. Different setups provide us with more options to pick the most accurate instance segmentation mask.
  • ...and 7 more figures