Table of Contents
Fetching ...

CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization

Xiaoman Feng, Mingkun Lei, Yang Wang, Dingwen Fu, Chi Zhang

TL;DR

This work introduces CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining, and can be seamlessly integrated into existing encoder-based diffusion models without retraining.

Abstract

Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.

CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization

TL;DR

This work introduces CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining, and can be seamlessly integrated into existing encoder-based diffusion models without retraining.

Abstract

Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.
Paper Structure (22 sections, 6 equations, 22 figures, 5 tables)

This paper contains 22 sections, 6 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: $\texttt{CleanStyle}$ improves text-aligned style transfer by effectively mitigating content leakage. Compared to InstantStyle, our results better preserve prompt semantics while faithfully reflecting the reference style.
  • Figure 2: Overview of $\texttt{CleanStyle}$. We decompose cross-attention style embeddings via SVD into main and tail components, apply time-aware suppression to the tail component in CS-SVD, and form conditional embeddings. From the visualization of singular value (the Key $K$ is used as as an example), at the earlier time step $t_{0}$, suppression is stronger, while suppression is weaker at the later time step to preserve style details. SS-CFG uses the isolated tail component to build style-aware unconditional inputs. The figure shows the decomposition, the time-dependent filtering, and the conditional/unconditional pathways in sampling.
  • Figure 3: Motivational illustration. The baseline exhibits clear content leakage. Using only the tail component (as defined in \ref{['fig:pipeline']}) further amplifies these artifacts, indicating that the tail region mainly encodes content-related signals rather than stylistic information. Conversely, relying solely on the main component weakens the overall style expression. These observations motivate our design: CS-SVD suppresses tail-induced content leakage, while the time-aware strategy modulates this suppression to avoid over-attenuating stylistic details, achieving a balanced and faithful stylization.
  • Figure 4: Qualitative comparison with the state-of-the-art encoder-based style transfer methods. Our approach effectively suppresses content leakage (row 1), achieves stronger prompt alignment (rows 2--4), and maintains higher visual fidelity with fewer structural or stylistic distortions (row 5).
  • Figure 5: Integrated with StyleShot and DEADiff. On both the comparisons, ours mitigate the content leakage issue and keep stylistic features.
  • ...and 17 more figures