Table of Contents
Fetching ...

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

Daniel Winter, Asaf Shul, Matan Cohen, Dana Berman, Yael Pritch, Alex Rav-Acha, Yedid Hoshen

TL;DR

ObjectMate introduces the object recurrence prior to synthesize a massive supervised dataset for object composition tasks, enabling simple diffusion models to perform both object insertion and subject-driven generation without test-time tuning. The approach leverages instance-retrieval features on large unlabeled datasets (COCO, Open Images, WebLI) to collect diverse object views and uses a background-removal model to describe the scene, training a latent diffusion network to map scene descriptions and object views to composites. It achieves state-of-the-art results in identity preservation and photorealistic integration, and introduces a supervised benchmark and a metric for object identity preservation based on IR features validated by user studies. The work suggests that further scaling of data and retrieval-quality will continue to improve performance and could extend to related editing and 3D tasks.

Abstract

This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

TL;DR

ObjectMate introduces the object recurrence prior to synthesize a massive supervised dataset for object composition tasks, enabling simple diffusion models to perform both object insertion and subject-driven generation without test-time tuning. The approach leverages instance-retrieval features on large unlabeled datasets (COCO, Open Images, WebLI) to collect diverse object views and uses a background-removal model to describe the scene, training a latent diffusion network to map scene descriptions and object views to composites. It achieves state-of-the-art results in identity preservation and photorealistic integration, and introduces a supervised benchmark and a metric for object identity preservation based on IR features validated by user studies. The work suggests that further scaling of data and retrieval-quality will continue to improve performance and could extend to related editing and 3D tasks.

Abstract

This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to fully meet the task's challenging objectives: (i) seamlessly composing the object into the scene with photorealistic pose and lighting, and (ii) preserving the object's identity. We hypothesize that achieving these goals requires large scale supervision, but manually collecting sufficient data is simply too expensive. The key observation in this paper is that many mass-produced objects recur across multiple images of large unlabeled datasets, in different scenes, poses, and lighting conditions. We use this observation to create massive supervision by retrieving sets of diverse views of the same object. This powerful paired dataset enables us to train a straightforward text-to-image diffusion architecture to map the object and scene descriptions to the composited image. We compare our method, ObjectMate, with state-of-the-art methods for object insertion and subject-driven generation, using a single or multiple references. Empirically, ObjectMate achieves superior identity preservation and more photorealistic composition. Differently from many other multi-reference methods, ObjectMate does not require slow test-time tuning.

Paper Structure

This paper contains 30 sections, 4 equations, 27 figures, 6 tables.

Figures (27)

  • Figure 1: Our method composes objects into scenes with photorealistic pose and lighting, while preserving their identity. The scene can be specified via an image or text. We do not use test-time tuning.
  • Figure 2: Retrieval feature comparison. Retrieval with DINO features (right) produces semantic matches, while instance retrieval features first_place (middle) find identical objects.
  • Figure 3: Object recurrence analysis:(a) Retrieval precision vs. similarity threshold. A threshold of $0.93$ yields $~70\%$ precision. (b) Similarity score distribution for 3 datasets between an object and its 3 nearest neighbors. The legend shows the percentage of objects within the range of $[0.93, 0.975]$. (c) The percentage of objects in this range grows super-linearly as we use larger subsets of WebLI.
  • Figure 4: Recurring mass-produced objects. Percentage of instances within classes of everyday objects with at least 3 retrieved recurrences in WebLI.
  • Figure 5: Creating a supervised dataset. For each unlabeled image, we detect and crop objects with high detection confidence. Next, we extract the kNN of these objects based on IR feature similarity. To generate the background image, we apply an object removal model.
  • ...and 22 more figures