Table of Contents
Fetching ...

Less is More: Masking Elements in Image Condition Features Avoids Content Leakages in Style Transfer Diffusion Models

Lin Zhu, Xinbing Wang, Chenghu Zhou, Qinying Gu, Nanyang Ye

TL;DR

This work addresses content leakage in style transfer diffusion models that use a style-reference image as an additional condition. It introduces a training-free masking approach that zeroes content-related elements in the style-reference image features, guided by clustering the element-wise product with content text features, to decouple content from style. The authors prove theoretically that guiding with fewer appropriately chosen conditions can yield a smaller divergence between generated and real image distributions, supporting a 'Less is More' principle. Empirically, the method delivers stronger style transfer with reduced content leakage across diverse styles and datasets, without requiring parameter tuning, highlighting practical robustness for diffusion-based style transfer tasks.

Abstract

Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference's image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.

Less is More: Masking Elements in Image Condition Features Avoids Content Leakages in Style Transfer Diffusion Models

TL;DR

This work addresses content leakage in style transfer diffusion models that use a style-reference image as an additional condition. It introduces a training-free masking approach that zeroes content-related elements in the style-reference image features, guided by clustering the element-wise product with content text features, to decouple content from style. The authors prove theoretically that guiding with fewer appropriately chosen conditions can yield a smaller divergence between generated and real image distributions, supporting a 'Less is More' principle. Empirically, the method delivers stronger style transfer with reduced content leakage across diverse styles and datasets, without requiring parameter tuning, highlighting practical robustness for diffusion-based style transfer tasks.

Abstract

Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference's image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.

Paper Structure

This paper contains 25 sections, 3 theorems, 20 equations, 22 figures, 6 tables, 2 algorithms.

Key Result

Proposition 1

[The superiority of the proposed masked element selection method] We denote the masked elements in the image feature as $\boldsymbol{e}_1^{s+1}, \cdots, \boldsymbol{e}_1^{d}$ and denote the feature composed by these elements as $\boldsymbol{e}_1^m$, i.e., $\boldsymbol{e}_1^m :=[\boldsymbol{e}_1^{s+1

Figures (22)

  • Figure 1: Given a style-reference image, our method is capable of synthesizing new images that resemble the style and are faithful to text prompts simultaneously. Previous methods often face issues of either content leakages or style degradation. We mark the results with significant content leakages, style degradation, and loss of text fidelity with red, green, and blue boxes, respectively.
  • Figure 2: Top: The differences in the conditions between IP-Adapter ye2023ip, InstantStyle wang2024instantstyle, and Ours. We elaborate on how to select masked elements in Section \ref{['sec:Zero-Shot-LecDiff']}. Bottom: Illustration of the content leakages issue.
  • Figure 3: (a) The proposed content-related elements identification method: we cluster the element-wise product between image and text features and directly discard elements in the high-means cluster; (b) Illustration of tuning-based models, which we detail in Section \ref{['sec:superiority-of-less-condition']}. Text-Adapter and Image-Adapter learn the content feature from the text content feature and image feature, respectively. Only the newly added feature adapter modules (denoted as "Linear+LN") are trained while the pre-trained diffusion model is frozen.
  • Figure 4: In the figure, the text prompt is "A human". Leveraging appropriately fewer conditions, Ours(ZS) and Ours(FT) denote the proposed masking-based method and the tuning-based Image-Adapter method, respectively. Our methods successfully transfer the references' styles without content leakages. More results can be found in Figure \ref{['fig:more-results-exp1-1']} and Figure \ref{['fig:more-results-exp1-2']}-\ref{['fig:more-results-exp1-6']} in Appendix \ref{['app-sec: more-results-exp1']}.
  • Figure 5: Comparison between the Image-Adapter and Text-Adapter model. (a) Following gao2024styleshot, we report the image and text alignment scores alongside training steps. We also present the tuning-free models' (i.e., IP-Adapter, InstantStyle, and our masking-based method) fidelity scores in the figure. (b) Visual comparisons between Image-Adapter and Text-Adapter.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Theorem 1
  • Theorem 2