Table of Contents
Fetching ...

ControlCom: Controllable Image Composition using Diffusion Model

Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu

TL;DR

ControlCom presents a controllable diffusion-based framework that unifies image blending, harmonization, view synthesis, and generative composition into a single model. It introduces a 2D indicator to selectively modify foreground illumination and pose, a two-stage global-local fusion via a Foreground Encoder, and a local enhancement module to preserve foreground fidelity. A self-supervised training pipeline generates four-task supervision from large-scale image collections, enabling end-to-end learning without manual labels. Experiments on COCOEE and the real-world FOSCom dataset show improved controllability and foreground fidelity over baselines, with code available for reproducibility.

Abstract

Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.

ControlCom: Controllable Image Composition using Diffusion Model

TL;DR

ControlCom presents a controllable diffusion-based framework that unifies image blending, harmonization, view synthesis, and generative composition into a single model. It introduces a 2D indicator to selectively modify foreground illumination and pose, a two-stage global-local fusion via a Foreground Encoder, and a local enhancement module to preserve foreground fidelity. A self-supervised training pipeline generates four-task supervision from large-scale image collections, enabling end-to-end learning without manual labels. Experiments on COCOEE and the real-world FOSCom dataset show improved controllability and foreground fidelity over baselines, with code available for reproducibility.

Abstract

Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.
Paper Structure (19 sections, 3 equations, 6 figures, 1 table)

This paper contains 19 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of our controllable image composition method. We unify four tasks in one diffusion model and enable control over the illumination and pose of the synthesized foreground objects with a 2-dim indicator vector.
  • Figure 2: Illustration of our ControlCom. Our model consists of two main components: a foreground encoder (a) that extracts hierarchical embeddings from foreground image, and a controllable generator (b) that synthesizes composite image with control over foreground illumination and pose using indicator $S$. See Figure \ref{['fig:local_enhance']} for the details of local enhancement module.
  • Figure 3: Illustration of the local enhancement module.
  • Figure 4: Flowchart of synthetic data generation and augmentation.
  • Figure 5: Qualitative comparison on COCOEE dataset (top half) and our FOSCom dataset (bottom half). See Supp. for more visual results.
  • ...and 1 more figures