DiffHarmony: Latent Diffusion Model Meets Image Harmonization
Pengfei Zhou, Fangxiang Feng, Xiaojie Wang
TL;DR
DiffHarmony tackles image harmonization by repurposing a pre-trained latent diffusion model (Stable Diffusion) to condition on the composite image $I_c$ and foreground mask $M$, producing harmonious results that can be blurry due to latent compression to $1/8$ resolution. It mitigates this distortion with two strategies: harmonizing at higher input resolution during inference and adding a refinement stage that learns a residual via a UNet, along with two adaptations—Inpainting Variation and Null Text Input—for effective conditioning. The approach achieves state-of-the-art or competitive results on the iHarmony4 dataset, with notable gains on challenging subsets such as HFlickr and Hday2night, and ablations validate the benefits of higher resolution and refinement. By reducing reliance on end-to-end diffusion training and addressing pixel-level fidelity, DiffHarmony offers a practical diffusion-based solution for image harmonization with potential for broader applications in image editing. $I_c$ and $M$ guide conditioning, while $1/8$ latent compression motivates the design choices that include higher-resolution inference and residual refinement, enabling robust, visually coherent harmonization.
Abstract
Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at https://github.com/nicecv/DiffHarmony .
