Table of Contents
Fetching ...

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji

TL;DR

Diffree addresses the challenge of text-guided object addition without requiring user-drawn masks, aiming to preserve background fidelity during insertion. It introduces OABench, a 74K synthetic dataset created by removing objects from real images to train and evaluate a diffusion-based model augmented with an Object Mask Predictor. The Diffree architecture couples a pre-trained Stable Diffusion backbone with an OMP module and classifier-free guidance, enabling text-driven mask and object generation within the input image. Across COCO and OpenImages, Diffree achieves high success rates and superior background consistency, while also supporting iterative insertions and integration with planning systems like GPT4V and tools like AnyDoor. Overall, the approach advances practical text-guided image editing by delivering reliable, mask-free object insertion with strong contextual alignment and versatile applications.

Abstract

This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

TL;DR

Diffree addresses the challenge of text-guided object addition without requiring user-drawn masks, aiming to preserve background fidelity during insertion. It introduces OABench, a 74K synthetic dataset created by removing objects from real images to train and evaluate a diffusion-based model augmented with an Object Mask Predictor. The Diffree architecture couples a pre-trained Stable Diffusion backbone with an OMP module and classifier-free guidance, enabling text-driven mask and object generation within the input image. Across COCO and OpenImages, Diffree achieves high success rates and superior background consistency, while also supporting iterative insertions and integration with planning systems like GPT4V and tools like AnyDoor. Overall, the approach advances practical text-guided image editing by delivering reliable, mask-free object insertion with strong contextual alignment and versatile applications.

Abstract

This paper addresses an important problem of object addition for images with only text guidance. It is challenging because the new object must be integrated seamlessly into the image with consistent visual context, such as lighting, texture, and spatial location. While existing text-guided image inpainting methods can add objects, they either fail to preserve the background consistency or involve cumbersome human intervention in specifying bounding boxes or user-scribbled masks. To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control. To this end, we curate OABench, an exquisite synthetic dataset by removing objects with advanced image inpainting techniques. OABench comprises 74K real-world tuples of an original image, an inpainted image with the object removed, an object mask, and object descriptions. Trained on OABench using the Stable Diffusion model with an additional mask prediction module, Diffree uniquely predicts the position of the new object and achieves object addition with guidance from only text. Extensive experiments demonstrate that Diffree excels in adding new objects with a high success rate while maintaining background consistency, spatial appropriateness, and object relevance and quality.
Paper Structure (34 sections, 9 equations, 11 figures, 1 table)

This paper contains 34 sections, 9 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Our approach iteratively generates inpainting results. The objects from text guided is reasonably added in images while ensuring the background consistency.
  • Figure 2: Qualitative comparisons of Diffree and different kinds of methods.
  • Figure 3: Diffree adds objects to the same image, with different spatial relationships.
  • Figure 4: Diffree iteratively generates results. Objects added later can relate to the earlier.
  • Figure 5: The data collection process of OABench.
  • ...and 6 more figures