Table of Contents
Fetching ...

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

Zhekai Chen, Wen Wang, Zhen Yang, Zeqing Yuan, Hao Chen, Chunhua Shen

TL;DR

FreeCompose leverages the diffusion prior of large pre-trained diffusion models to achieve generic zero-shot image composition without additional training. It identifies low-density regions in copy-paste composites via the diffusion denoising process and optimizes the image towards high-density areas, complemented by a mask-guided DDS loss for semantic control. The method covers object removal, image harmonization, and semantic image composition, showing competitive results and enabling applications like multi-concept customization. This work demonstrates the potential of diffusion priors to unify disparate composition tasks in a training-free framework, with implications for open-world editing and video extension.

Abstract

We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition applicable to both scenarios. We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. In addition, we introduce a novel maskguided loss to further enable flexible semantic image composition. Extensive experiments validate the superiority of our approach in achieving generic zero-shot image composition. Additionally, our approach shows promising potential in various tasks, such as object removal and multiconcept customization.

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

TL;DR

FreeCompose leverages the diffusion prior of large pre-trained diffusion models to achieve generic zero-shot image composition without additional training. It identifies low-density regions in copy-paste composites via the diffusion denoising process and optimizes the image towards high-density areas, complemented by a mask-guided DDS loss for semantic control. The method covers object removal, image harmonization, and semantic image composition, showing competitive results and enabling applications like multi-concept customization. This work demonstrates the potential of diffusion priors to unify disparate composition tasks in a training-free framework, with implications for open-world editing and video extension.

Abstract

We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition applicable to both scenarios. We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. In addition, we introduce a novel maskguided loss to further enable flexible semantic image composition. Extensive experiments validate the superiority of our approach in achieving generic zero-shot image composition. Additionally, our approach shows promising potential in various tasks, such as object removal and multiconcept customization.
Paper Structure (46 sections, 8 equations, 12 figures, 2 tables, 3 algorithms)

This paper contains 46 sections, 8 equations, 12 figures, 2 tables, 3 algorithms.

Figures (12)

  • Figure 2: Observations on the diffusion prior. The images on the left, denoted as copy-paste images, are obtained by simply pasting the foreground object to the background image. The frozen diffusion model takes the noisy copy-paste images from varying diffusion forward steps as input, and predicts the gradient to update the images (visualized on the right). Low-density regions with larger gradient updates are highlighted by red boxes. The low-density regions are highly consistent with the inharmonious regions caused by naive copy-paste.
  • Figure 3: Pipeline overview. Our FreeCompose pipeline consists of three phases: object removal, image harmonization, and semantic image composition. In each phase, the pipeline takes an input image and two text prompts to calculate the loss. In the object removal phase, an additional mask is required to select $K$, $V$ values. In the semantic image composition phase, text prompts can be replaced by other formats, and an additional $K$, $V$ replacement is implemented for identity consistency.
  • Figure 4: Qualitative comparison on object removal. We compare with Lamasuvorov2022resolution, Stable Diffusion Inpaintingmulticoncept, and Repaintlugmayr2022repaint.
  • Figure 5: Qualitative comparison on image harmonization. We compare our method with zero-shot Diff Harmonization huang2024diffusion and training-based DucoNet duconet.
  • Figure 6: Results on semantic image composition. Our method accepts various conditions as guidance, including text and sketches. The case in the top-left corner uses different prompts as guidance for editing, while the other cases are guided by different sketches with identical prompts.
  • ...and 7 more figures