Table of Contents
Fetching ...

CareCom: Generative Image Composition with Calibrated Reference Features

Jiaxuan Chen, Bo Zhang, Qingdong He, Jinlong Peng, Li Niu

TL;DR

CareCom addresses the challenge of generative image composition with multiple foreground references by calibrating reference features to fit the background. It introduces global and local reference feature calibration modules that produce augmented features and are injected into a denoising diffusion network, enabling simultaneous detail preservation and pose/view adjustment. The method is pretrained on MVImgNet and finetuned with few-shot exemplars, and it outperforms baselines in background fidelity, pose compatibility, and overall image quality on MVImgNet and MureCom. This approach demonstrates that calibrated, multi-reference features can substantially improve realism and fidelity in foreground insertion, with practical implications for flexible image editing and content creation.

Abstract

Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.

CareCom: Generative Image Composition with Calibrated Reference Features

TL;DR

CareCom addresses the challenge of generative image composition with multiple foreground references by calibrating reference features to fit the background. It introduces global and local reference feature calibration modules that produce augmented features and are injected into a denoising diffusion network, enabling simultaneous detail preservation and pose/view adjustment. The method is pretrained on MVImgNet and finetuned with few-shot exemplars, and it outperforms baselines in background fidelity, pose compatibility, and overall image quality on MVImgNet and MureCom. This approach demonstrates that calibrated, multi-reference features can substantially improve realism and fidelity in foreground insertion, with practical implications for flexible image editing and content creation.

Abstract

Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.

Paper Structure

This paper contains 19 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of feature calibration process. Based on the background and foreground bounding box, we calibrate the features of foreground reference images to match those of ground-truth foreground.
  • Figure 2: (a) Given multiple foreground reference images, we extract their global/local features $\mathcal{F}^g$/$\mathcal{F}^l$, which are passed through the calibration module $C^g$/$C^l$. The calibrated features $\tilde{\mathcal{F}}^g$/$\tilde{\mathcal{F}}^l$ are injected into the decoder of denoising UNet. (b) Illustration of seeking for the spatial correspondence of local patches between foreground reference $\bm{I}_k^f$ and ground-truth foreground $\hat{\bm{I}}^f$. (c) The structure of calibration module $C^g$/$C^l$.
  • Figure 3: Visual comparison of different methods on MureCom dataset. From left to right, we show background, 5 reference images, the results of Anydoor chen2024anydoor, ControlCom zhang2023controlcom, ObjectStitch Song_2023_CVPR, Insert Anything song2025insert, UniCombine wang2025unicombine and our CareCom.
  • Figure 4: Ablation study of our GRFC/LRFC modules and uncalibrated features (UCF). From left to right, we show background image, five reference images, the results of ObjectStitch, three ablated versions of our method, and our full method. The red boxes indicate the changed details.
  • Figure 5: The distance between uncalibrated/calibrated reference features and ground-truth reference features along with the denoising step.
  • ...and 1 more figures