Table of Contents
Fetching ...

Style-Friendly SNR Sampler for Style-Driven Generation

Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Jungbeom Lee, Sungroh Yoon

TL;DR

The paper addresses the challenge of learning novel personalized styles with diffusion models, which standard fine-tuning often fails to capture because style cues emerge at higher noise levels. It introduces the Style-friendly SNR sampler, which biases the log-SNR distribution toward high-noise regimes (e.g., $\lambda_t \sim \mathcal{N}(−6, \sigma^2)$) and maps to timesteps via $t = 1/(1+\exp(\lambda_t/2))$, paired with trainable LoRA adapters on MM-DiT to enable efficient style adaptation. Empirical results show improved style alignment across diverse reference styles and prompts, with qualitative and quantitative gains over baselines like SD3 and DCO, and demonstrated applications in multi-panel comics and typography. The approach offers a practical pathway to create and share new style templates for personalized content creation while highlighting the importance of training emphasis on high-noise levels for effective style learning.

Abstract

Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it often blindly utilizes objectives and noise level distributions from pre-training without adaptation. We discover that stylistic features predominantly emerge at higher noise levels, leading current fine-tuning methods to exhibit suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enhances models' ability to capture novel styles indicated by reference images and text prompts. We demonstrate improved generation of novel styles that cannot be adequately described solely with a text prompt, enabling the creation of new style templates for personalized content creation.

Style-Friendly SNR Sampler for Style-Driven Generation

TL;DR

The paper addresses the challenge of learning novel personalized styles with diffusion models, which standard fine-tuning often fails to capture because style cues emerge at higher noise levels. It introduces the Style-friendly SNR sampler, which biases the log-SNR distribution toward high-noise regimes (e.g., ) and maps to timesteps via , paired with trainable LoRA adapters on MM-DiT to enable efficient style adaptation. Empirical results show improved style alignment across diverse reference styles and prompts, with qualitative and quantitative gains over baselines like SD3 and DCO, and demonstrated applications in multi-panel comics and typography. The approach offers a practical pathway to create and share new style templates for personalized content creation while highlighting the importance of training emphasis on high-noise levels for effective style learning.

Abstract

Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it often blindly utilizes objectives and noise level distributions from pre-training without adaptation. We discover that stylistic features predominantly emerge at higher noise levels, leading current fine-tuning methods to exhibit suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enhances models' ability to capture novel styles indicated by reference images and text prompts. We demonstrate improved generation of novel styles that cannot be adequately described solely with a text prompt, enabling the creation of new style templates for personalized content creation.

Paper Structure

This paper contains 48 sections, 7 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Fine-tuning text-to-image diffusion models on the style-friendly noise levels enables learning novel styles from reference images and text prompts. We present 'A kangaroo holding a beer, wearing ski goggles and passionately singing silly songs' in various styles including watercolor painting, flat illustration, and 3d rendering styles. References are shown in the red insert box.
  • Figure 2: Probability distribution of Log-SNR. We bias the distribution towards the shaded region where style features emerge.
  • Figure 3: Fine-tuning capability. (a) While FLUX succeeds in learning objects, (b) it struggles to capture styles, demonstrating that learning novel objects and styles requires distinct strategies. (c) We enable FLUX to learn styles. References are shown in the red insert box.
  • Figure 4: Prompt switching during generation.$\lambda_t$ indicates log-SNR. The bar graphs above each image represent the denoising steps, illustrating when each prompt is applied and at what point the prompt switch occurs. The style prompts are 'minimalist flat round logo', 'sticker', 'detailed pen and ink drawing', and 'cartoon'. Styles emerge in the initial 10% of denoising steps; therefore, (c) and (f) fail to capture target styles. In contrast, omitting style prompts in later steps (d,e) still preserves styles well, similar to the fully styled baseline (a). (g) and (h) quantify these observations, showing the average CLIP similarity across 5 prompts and 5 styles when omitting (g) or including (h) the style prompt in earlier steps. Here, we use FLUX with 28 inference steps.
  • Figure 5: Effect of varying $\bm{\mu}$ and $\bm{\sigma}$. Diffusion models start to capture the reference glowing style when $\mu$ is lower and $\sigma$ is larger. The prompt is 'a hybrid creature that is a mix of a waffle and a hippopotamus, in glowing style'. Samples are generated with the same seed.
  • ...and 14 more figures