PICS: Pairwise Image Compositing with Spatial Interactions

Hang Zhou; Xinxin Zuo; Sen Wang; Li Cheng

PICS: Pairwise Image Compositing with Spatial Interactions

Hang Zhou, Xinxin Zuo, Sen Wang, Li Cheng

TL;DR

PICS is introduced, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background.

Abstract

Despite strong single-turn performance, diffusion-based image compositing often struggles to preserve coherent spatial relations in pairwise or sequential edits, where subsequent insertions may overwrite previously generated content and disrupt physical consistency. We introduce PICS, a self-supervised composition-by-decomposition paradigm that composes objects in parallel while explicitly modeling the compositional interactions among (fully-/partially-)visible objects and background. At its core, an Interaction Transformer employs mask-guided Mixture-of-Experts to route background, exclusive, and overlap regions to dedicated experts, with an adaptive α-blending strategy that infers a compatibility-aware fusion of overlapping objects while preserving boundary fidelity. To further enhance robustness to geometric variations, we incorporate geometry-aware augmentations covering both out-of-plane and in-plane pose changes of objects. Our method delivers superior pairwise compositing quality and substantially improved stability, with extensive evaluations across virtual try-on, indoor, and street scene settings showing consistent gains over state-of-the-art baselines. Code and data are available at https://github.com/RyanHangZhou/PICS

PICS: Pairwise Image Compositing with Spatial Interactions

TL;DR

Abstract

Paper Structure (61 sections, 22 equations, 23 figures, 7 tables, 1 algorithm)

This paper contains 61 sections, 22 equations, 23 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Image compositing.
Multi-turn image editing.
Projected object relations.
Methodology
Pairwise Image Compositing
Exploring two-turn compositing.
Parallel image-prompted compositing.
Interaction Transformer
Feature-space routing masks.
Spatially-aware Mixture-of-Experts.
Region-gated updates and aggregation.
Augmentations
Multi-view shape prior.
...and 46 more sections

Figures (23)

Figure 1: Our method generates spatially plausible and visually realistic pairwise compositions. Each row illustrates two examples, consisting of (from left to right) the objects, the masked background, and two exemplar composite results. Additional comparative results appear in the appendix.
Figure 2: Visual comparison of pairwise support relations across Paint-by-Paint, ControlCom, ObjectStitch, AnyDoor, FreeCompose, OmniPaint and InsertAnything. Left: backgrounds and two objects; right: compositing results. The first row shows composites with the basket, and the second row shows subsequent composites obtained by adding the bread on top. Unlike prior methods that suffer from contact artifacts and fidelity loss, our approach performs parallel compositing, effectively handling spatial occlusions and yielding consistent results with preserved fine-grained structure.
Figure 3: Overview of PICS. Input data are constructed by decomposing the target image into a background and pairwise objects with their designated regions. (a) The interaction diffusion network composites the objects into the background. (b) The interaction transformer block, shared across both branches, models interactions among objects and with the background. (c) Expert blocks focus on distinct spatial regions. All notations are defined in the main text for clarity.
Figure 4: Qualitative comparison on the LVIS validation set. Source images, backgrounds, and the two decomposed objects are shown on the left. On the right are the recompositing results from different methods. Our approach is the only one that produces composites with realistic spatial interactions between scene objects while maintaining scene consistency and object identity.
Figure 5: Qualitative comparison of different composition orders on the DreamBooth test set. Left: backgrounds and two objects. Right: results from different methods. Our approach better preserves natural contacts and occlusions, while implicitly learning the correct occlusion order.
...and 18 more figures

PICS: Pairwise Image Compositing with Spatial Interactions

TL;DR

Abstract

PICS: Pairwise Image Compositing with Spatial Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (23)