Table of Contents
Fetching ...

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng

TL;DR

BootPlace reframes object placement as a placement-by-detection problem, detecting regions of interest on object-subtracted backgrounds with a DETR-style transformer and then associating each target object to an appropriate region via a differentiable, probabilistic object-to-region network. A bootstrapped training strategy expands data diversity by recombining intact objects into the object-subtracted images, enabling robust, multi-object placement. The associating matrix is defined over $T$ object queries and $N$ ROIs, forming an $\mathbb{R}^{T \times N}$ representation used to predict placements. On Cityscapes and OPA, BootPlace achieves state-of-the-art object reposition and placement results, with ablations confirming the importance of ROI detection, semantic complementarity scoring, and multi-object supervision for improved realism and generalization.

Abstract

In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

TL;DR

BootPlace reframes object placement as a placement-by-detection problem, detecting regions of interest on object-subtracted backgrounds with a DETR-style transformer and then associating each target object to an appropriate region via a differentiable, probabilistic object-to-region network. A bootstrapped training strategy expands data diversity by recombining intact objects into the object-subtracted images, enabling robust, multi-object placement. The associating matrix is defined over object queries and ROIs, forming an representation used to predict placements. On Cityscapes and OPA, BootPlace achieves state-of-the-art object reposition and placement results, with ablations confirming the importance of ROI detection, semantic complementarity scoring, and multi-object supervision for improved realism and generalization.

Abstract

In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.

Paper Structure

This paper contains 33 sections, 7 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Our approach, BootPlace, detects regions of interest (represented as bounding boxes) for object composition and assigns each target object to its best-matched detected region. Each object is connected to each detected region with weighted connections, with the bold arrow indicating the strongest link.
  • Figure 2: Network inference. Given a target image, several object queries (e.g., two cars and a pedestrian) and scene object locations, BootPlace detects a set of candidate region of interest and associates each object with the best-fitting region, which are used to produce the composite image. $\otimes$ is feature concatenation and $\bigtriangledown$ is region-wise product.
  • Figure 3: Network architecture and training. We prepare training data by first decomposing a source image into a randomly-object-subtracted image $\mathcal{I}$ and a set of object queries. During training, image $I$ and scene object locations are both fed into a detection transformer for region-of-interest detection. The object queries are fed into an association network for object-to-region matching, where the generated association links each object query with the detected region of interest. Losses comprises of detection loss and association loss. At the high level, we visualize the relations among object queries, detected regions of interest and ground-truth locations on the right side, where the best-matched association arrow is highlighted in bold.
  • Figure 4: Qualitative results of object reposition on Cityscapes dataset. Zoom in to see visual details. 1st column: original images with highlighted compositing objects; 2nd column: inpainted images after object subtraction.
  • Figure 5: Qualitative results of object reposition on OPA dataset. "Positive composites" are annotated good-quality composites from OPA. SAC-GAN is excluded as it requires semantic maps for training.
  • ...and 18 more figures