DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Pengfei Zhou; Fangxiang Feng; Xiaojie Wang

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Pengfei Zhou, Fangxiang Feng, Xiaojie Wang

TL;DR

DiffHarmony tackles image harmonization by repurposing a pre-trained latent diffusion model (Stable Diffusion) to condition on the composite image $I_c$ and foreground mask $M$, producing harmonious results that can be blurry due to latent compression to $1/8$ resolution. It mitigates this distortion with two strategies: harmonizing at higher input resolution during inference and adding a refinement stage that learns a residual via a UNet, along with two adaptations—Inpainting Variation and Null Text Input—for effective conditioning. The approach achieves state-of-the-art or competitive results on the iHarmony4 dataset, with notable gains on challenging subsets such as HFlickr and Hday2night, and ablations validate the benefits of higher resolution and refinement. By reducing reliance on end-to-end diffusion training and addressing pixel-level fidelity, DiffHarmony offers a practical diffusion-based solution for image harmonization with potential for broader applications in image editing. $I_c$ and $M$ guide conditioning, while $1/8$ latent compression motivates the design choices that include higher-resolution inference and residual refinement, enabling robust, visually coherent harmonization.

Abstract

Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at https://github.com/nicecv/DiffHarmony .

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

TL;DR

DiffHarmony tackles image harmonization by repurposing a pre-trained latent diffusion model (Stable Diffusion) to condition on the composite image

and foreground mask

, producing harmonious results that can be blurry due to latent compression to

resolution. It mitigates this distortion with two strategies: harmonizing at higher input resolution during inference and adding a refinement stage that learns a residual via a UNet, along with two adaptations—Inpainting Variation and Null Text Input—for effective conditioning. The approach achieves state-of-the-art or competitive results on the iHarmony4 dataset, with notable gains on challenging subsets such as HFlickr and Hday2night, and ablations validate the benefits of higher resolution and refinement. By reducing reliance on end-to-end diffusion training and addressing pixel-level fidelity, DiffHarmony offers a practical diffusion-based solution for image harmonization with potential for broader applications in image editing.

and

guide conditioning, while

latent compression motivates the design choices that include higher-resolution inference and residual refinement, enabling robust, visually coherent harmonization.

Abstract

Paper Structure (22 sections, 2 figures, 4 tables)

This paper contains 22 sections, 2 figures, 4 tables.

Introduction
Method
DiffHarmony: Adapting Stable Diffusion
Inpainting Variation
Null Text Input
Alleviate Image Distortion
Harmonization At Higher Resolution
Add Refinement Stage
Experiment
Experiment Settings
Dataset
Implementation Detail
Evaluation
Performance Comparison
Qualitative Results
...and 7 more sections

Figures (2)

Figure 1: Architecture of our method. In the harmonization stage involving DiffHarmony, composite image $I_c$ and foreground mask $M$ are concatenated as image condition after encoded through VAE and downsample respectively. The diffusion model performs inference, and the output is mapped back to image space through VAE decoder, resulting $\tilde{I}_h$. In the refinement stage, we scale down $\tilde{I}_h$, $I_c$, $M$ and concatenate them together as input to refinement model. After adding refinement model output to downscaled $\tilde{I}_h$, final refined image, $I_h$ is obtained.
Figure 2: Qualitative comparison on samples from the test set of iHarmony4.

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

TL;DR

Abstract

DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)