Table of Contents
Fetching ...

Zero-Shot Image Harmonization with Generative Model Prior

Jianqi Chen, Yilan Zhang, Zhengxia Zou, Keyan Chen, Zhenwei Shi

TL;DR

This work tackles the challenge of image harmonization without large training datasets by proposing a zero-shot, human-behavior–inspired pipeline. A vision-language model generates imaging-condition descriptions, which guide a text-to-image diffusion process to harmonize the foreground, with text-embedding refinement and structure-preserving constraints to maintain content integrity; an evaluator steers iterations toward natural-looking results. Key contributions include the embedding refinement objective $ ext{L}_{Emb} = \| M - \frac{Att(Emb)}{max(Att(Emb))} \|_2^2$ and the structure-preservation loss $ ext{L}_{Edge} = \| E_S(I) - E_S(I') \|_2^2 + \gamma\| E_D(I) - E_D(I') \|_2^2$, plus a two-class classifier for automatic evaluation and a real-world 300-image dataset with a comprehensive user study. The approach demonstrates competitive qualitative results and superior user preference compared to supervised baselines, highlighting the potential of leveraging the priors in pretrained generative models for robust, data-efficient harmonization with broad applicability.

Abstract

We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-language model (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with a user study validating the effectiveness of our approach.

Zero-Shot Image Harmonization with Generative Model Prior

TL;DR

This work tackles the challenge of image harmonization without large training datasets by proposing a zero-shot, human-behavior–inspired pipeline. A vision-language model generates imaging-condition descriptions, which guide a text-to-image diffusion process to harmonize the foreground, with text-embedding refinement and structure-preserving constraints to maintain content integrity; an evaluator steers iterations toward natural-looking results. Key contributions include the embedding refinement objective and the structure-preservation loss , plus a two-class classifier for automatic evaluation and a real-world 300-image dataset with a comprehensive user study. The approach demonstrates competitive qualitative results and superior user preference compared to supervised baselines, highlighting the potential of leveraging the priors in pretrained generative models for robust, data-efficient harmonization with broad applicability.

Abstract

We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images in existing methods. These methods, while showing promising results, involve significant training expenses and often struggle with generalization to unseen images. To this end, we introduce a fully modularized framework inspired by human behavior. Leveraging the reasoning capabilities of recent foundation models in language and vision, our approach comprises three main stages. Initially, we employ a pretrained vision-language model (VLM) to generate descriptions for the composite image. Subsequently, these descriptions guide the foreground harmonization direction of a text-to-image generative model (T2I). We refine text embeddings for enhanced representation of imaging conditions and employ self-attention and edge maps for structure preservation. Following each harmonization iteration, an evaluator determines whether to conclude or modify the harmonization direction. The resulting framework, mirroring human behavior, achieves harmonious results without the need for extensive training. We present compelling visual results across diverse scenes and objects, along with a user study validating the effectiveness of our approach.
Paper Structure (17 sections, 6 equations, 17 figures, 1 table)

This paper contains 17 sections, 6 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Given a composite image, our method can achieve its harmonized result, where the color space of the foreground is aligned with that of the background. Our method does not need to collect a large number of composite images for training, but only utilizes pretrained generative models. The first column from the left in the upper row is the source image of the foreground ("house"), and the others are the harmonized results of the foreground object in different backgrounds. In the lower row, we take one of the composite images as an example to show the harmonization process. For a concise overview of our approach, please refer to our presentation video: https://www.youtube.com/watch?v=mfBTIVp6JBU&t=4s.
  • Figure 2: Human behavior of image harmonization. We humans can perform harmonization relying only on our long-term prior, without seeing many composite images in advance. E.g., to harmonize the overbright dog above.
  • Figure 3: The proposed framework and workflow across modules. The framework comprises three main components: (a) Imaging Condition Description Generation: A vision-language model (VLM) is utilized to generate descriptions of the input composite image, detailing what the object is and how the foreground and background region are. (b) Foreground Region Harmonization: A text-to-image (T2I) diffusion generative model is employed, taking both the previously generated descriptions and the edge map of the composite images as input. Image editing technology, together with specific harmonization operations, is used to achieve image harmonization. (c) Performance Evaluation: A two-class classifier serves as an evaluator to determine whether the current result is natural-looking enough or if the description should be regenerated, or if the harmonization iteration should continue.
  • Figure 4: Designs in the Foreground Region Harmonization. Starting from the composite image, we invert the last harmonized result into its diffusion latent and then employ harmonization operations to obtain the next harmonized result. Among these harmonization operations, (a) Text Embedding Refinement is designed to obtain text embeddings that can better represent the foreground/background environment. (b) In Content Structure Preservation, we leverage self-attention maps to retain high-level structure and utilize edge maps to preserve low-level details. The editing process is achieved based on Prompt-to-Prompt (P2P) editing technology hertz2022prompt. For brevity, the figure does not explicitly depict this part, and further details can be referred to in their work.
  • Figure 5: Demonstration of the evaluation process. We visualize how the decision of Continue/Regenerate/Conclude is made. Please zoom in for a better view.
  • ...and 12 more figures