Table of Contents
Fetching ...

StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer

Ruojun Xu, Weijie Xi, Xiaodi Wang, Yongbo Mao, Zach Cheng

TL;DR

StyleSSP tackles two persistent problems in training-free diffusion-based style transfer: content changes to the original image and leakage of content from the style image. It introduces sampling startpoint enhancement via Frequency Manipulation and Negative Guidance via Inversion, enabling better content preservation and decoupling of style from content, with ControlNet and IP-Instruct providing targeted control and extraction. Empirical results on MS-COCO and WikiArt against multiple baselines show improvements in ArtFID, FID, and LPIPS, along with strong qualitative results and ablations supporting the two core components. The approach delivers a practical, training-free solution that yields sharper content structures and more faithful style transfer, with potential extensions to region-aware startpoint strategies.

Abstract

Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image. Project page: https://github.com/bytedance/StyleSSP.

StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer

TL;DR

StyleSSP tackles two persistent problems in training-free diffusion-based style transfer: content changes to the original image and leakage of content from the style image. It introduces sampling startpoint enhancement via Frequency Manipulation and Negative Guidance via Inversion, enabling better content preservation and decoupling of style from content, with ControlNet and IP-Instruct providing targeted control and extraction. Empirical results on MS-COCO and WikiArt against multiple baselines show improvements in ArtFID, FID, and LPIPS, along with strong qualitative results and ablations supporting the two core components. The approach delivers a practical, training-free solution that yields sharper content structures and more faithful style transfer, with potential extensions to region-aware startpoint strategies.

Abstract

Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image. Project page: https://github.com/bytedance/StyleSSP.
Paper Structure (24 sections, 15 equations, 15 figures, 2 tables)

This paper contains 24 sections, 15 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Current problems for style transfer and our improvements. (a) Original content changes in previous work (right) even with ControlNet as an additional content controller. (b) Content leakage from style image in previous work (right), where the river from original image is covered by a lawn that shouldn't exist. (c) Given a style image and content image, StyleSSP is capable of synthesizing new images that achieve the best style transfer effect while preserving the details of original content.
  • Figure 2: Overall Framework. (Left) Illustration of the proposed style transfer method. First, we invert the content image $I^c$ into the latent noise space as $z_T^c$. During this process, we use negative guidance (Sec. \ref{['sec: NG']}) to ensure that $z_T^c$ diverges from the content information of the style image. We then apply frequency manipulation (Sec. \ref{['sec: FM']}) to $z_T^c$, obtaining a low-frequency reduced latent $z_T^{c,\,'}$ as the startpoint for the sampling stage. During sampling, we follow InstantStyle's approach by injecting style features exclusively into the style-specific block and utilizing the ControlNet model to further preserve original content. (Right) Detailed explanation of frequency manipulation. We reduce the low-frequency components by a factor $\alpha$, while adding Gaussian noise proportional to $1 - \alpha$.
  • Figure 3: Reconstruction results with varying $\alpha$ values, demonstrating that high-frequency components play a critical role in the image layout, while low-frequency components contribute less to layout preservation.
  • Figure 4: Style transfer results wi/o frequency manipulation, representing the detail preservation enhancement of frequency manipulation. Result with frequency manipulation outperforms in keeping the text and lines in the background.
  • Figure 5: Illustrations of negative guidance via inversion, negative guidance in sampling step and negative prompt guidance results for style transfer. The latter two all face severe content leakage problems (the out-of-place grass on the river), while our method prevents this phenomenon.
  • ...and 10 more figures