Less is More: Masking Elements in Image Condition Features Avoids Content Leakages in Style Transfer Diffusion Models
Lin Zhu, Xinbing Wang, Chenghu Zhou, Qinying Gu, Nanyang Ye
TL;DR
This work addresses content leakage in style transfer diffusion models that use a style-reference image as an additional condition. It introduces a training-free masking approach that zeroes content-related elements in the style-reference image features, guided by clustering the element-wise product with content text features, to decouple content from style. The authors prove theoretically that guiding with fewer appropriately chosen conditions can yield a smaller divergence between generated and real image distributions, supporting a 'Less is More' principle. Empirically, the method delivers stronger style transfer with reduced content leakage across diverse styles and datasets, without requiring parameter tuning, highlighting practical robustness for diffusion-based style transfer tasks.
Abstract
Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference's image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.
