Zero-Shot Image Harmonization with Generative Model Prior
Jianqi Chen, Yilan Zhang, Zhengxia Zou, Keyan Chen, Zhenwei Shi
TL;DR
This work tackles the challenge of image harmonization without large training datasets by proposing a zero-shot, human-behavior–inspired pipeline. A vision-language model generates imaging-condition descriptions, which guide a text-to-image diffusion process to harmonize the foreground, with text-embedding refinement and structure-preserving constraints to maintain content integrity; an evaluator steers iterations toward natural-looking results. Key contributions include the embedding refinement objective $ ext{L}_{Emb} = \| M - \frac{Att(Emb)}{max(Att(Emb))} \|_2^2$ and the structure-preservation loss $ ext{L}_{Edge} = \| E_S(I) - E_S(I') \|_2^2 + \gamma\| E_D(I) - E_D(I') \|_2^2$, plus a two-class classifier for automatic evaluation and a real-world 300-image dataset with a comprehensive user study. The approach demonstrates competitive qualitative results and superior user preference compared to supervised baselines, highlighting the potential of leveraging the priors in pretrained generative models for robust, data-efficient harmonization with broad applicability.
Abstract
We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-language model (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with a user study validating the effectiveness of our approach.
