Table of Contents
Fetching ...

Training-and-Prompt-Free General Painterly Harmonization via Zero-Shot Disentenglement on Style and Content References

Teng-Fang Hsiao, Bo-Kai Ruan, Hong-Han Shuai

TL;DR

Painterly harmonization is challenged by content disruption and training dependencies. The paper presents TF-GPH, a training-and-prompt-free diffusion-based framework that uses a Similarity Disentangle Mask and Similarity Reweighting to selectively fuse content and style from three inputs via a Share-Attention Module. It introduces the General Painterly Harmonization Benchmark (GPH-Benchmark) with range-based metrics to capture stylization-content trade-offs. Across qualitative and quantitative evaluations, TF-GPH achieves harmonious results without fine-tuning, offering flexible, real-world painterly editing capabilities.

Abstract

Painterly image harmonization aims at seamlessly blending disparate visual elements within a single image. However, previous approaches often struggle due to limitations in training data or reliance on additional prompts, leading to inharmonious and content-disrupted output. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method (TF-GPH). TF-GPH incorporates a novel ``Similarity Disentangle Mask'', which disentangles the foreground content and background image by redirecting their attention to corresponding reference images, enhancing the attention mechanism for multi-image inputs. Additionally, we propose a ``Similarity Reweighting'' mechanism to balance harmonization between stylization and content preservation. This mechanism minimizes content disruption by prioritizing the content-similar features within the given background style reference. Finally, we address the deficiencies in existing benchmarks by proposing novel range-based evaluation metrics and a new benchmark to better reflect real-world applications. Extensive experiments demonstrate the efficacy of our method in all benchmarks. More detailed in https://github.com/BlueDyee/TF-GPH.

Training-and-Prompt-Free General Painterly Harmonization via Zero-Shot Disentenglement on Style and Content References

TL;DR

Painterly harmonization is challenged by content disruption and training dependencies. The paper presents TF-GPH, a training-and-prompt-free diffusion-based framework that uses a Similarity Disentangle Mask and Similarity Reweighting to selectively fuse content and style from three inputs via a Share-Attention Module. It introduces the General Painterly Harmonization Benchmark (GPH-Benchmark) with range-based metrics to capture stylization-content trade-offs. Across qualitative and quantitative evaluations, TF-GPH achieves harmonious results without fine-tuning, offering flexible, real-world painterly editing capabilities.

Abstract

Painterly image harmonization aims at seamlessly blending disparate visual elements within a single image. However, previous approaches often struggle due to limitations in training data or reliance on additional prompts, leading to inharmonious and content-disrupted output. To surmount these hurdles, we design a Training-and-prompt-Free General Painterly Harmonization method (TF-GPH). TF-GPH incorporates a novel ``Similarity Disentangle Mask'', which disentangles the foreground content and background image by redirecting their attention to corresponding reference images, enhancing the attention mechanism for multi-image inputs. Additionally, we propose a ``Similarity Reweighting'' mechanism to balance harmonization between stylization and content preservation. This mechanism minimizes content disruption by prioritizing the content-similar features within the given background style reference. Finally, we address the deficiencies in existing benchmarks by proposing novel range-based evaluation metrics and a new benchmark to better reflect real-world applications. Extensive experiments demonstrate the efficacy of our method in all benchmarks. More detailed in https://github.com/BlueDyee/TF-GPH.
Paper Structure (39 sections, 3 equations, 18 figures, 8 tables, 1 algorithm)

This paper contains 39 sections, 3 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: Our method overcomes the resolution and staged-progressive painterly harmonization limitations present in the SOTA method ProPIH propih, where users are restricted to selecting stylization strength from one of four stages. In contrast, our approach offer continuously adjustable hyperparameters, allowing for more flexible stylization. Additionally, our method effectively mitigates content disruption issues, such as facial alterations, commonly seen in image-editing methods like ZSTAR zstar.
  • Figure 2: An example demonstrates three tasks in general painterly harmonization: Object Insertion (columns 1 to 3), Object Swapping (columns 4 and 5), and Style Transfer (columns 6 and 7). The top row features user-generated composite images, where green boxes highlighting the style reference of final two. The bottom row showcases the results using our method.
  • Figure 3: The architecture of our proposed TF-GPH method involves several stages. Initially, we feed the denoising U-Net with the inverse latent $Z_t$, and during the first $l < L_{\text{share}}-1$ layers of the U-Net, the three latent representations, $z^\text{f}_t$, $z^\text{b}_t$, and $z^\text{c}_t$, are forwarded separately to the Attention Module. Afterward, they are fed into the Share-Attention Module (the blue part below), obtaining their image-wise attention via Eq.\ref{['eq:share_attention']}. In the end, the output harmonized image $I^\text{o}$ is produced.
  • Figure 4: Comparisons of different attention strategy with corresponding similarity mask (read with Fig.\ref{['fig:overall']}).
  • Figure 5: Qualitative result of object insertion (rows 1 and 2), object swapping (rows 3 and 4), and style transfer (rows 5 and 6)
  • ...and 13 more figures