Style-Friendly SNR Sampler for Style-Driven Generation
Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Jungbeom Lee, Sungroh Yoon
TL;DR
The paper addresses the challenge of learning novel personalized styles with diffusion models, which standard fine-tuning often fails to capture because style cues emerge at higher noise levels. It introduces the Style-friendly SNR sampler, which biases the log-SNR distribution toward high-noise regimes (e.g., $\lambda_t \sim \mathcal{N}(−6, \sigma^2)$) and maps to timesteps via $t = 1/(1+\exp(\lambda_t/2))$, paired with trainable LoRA adapters on MM-DiT to enable efficient style adaptation. Empirical results show improved style alignment across diverse reference styles and prompts, with qualitative and quantitative gains over baselines like SD3 and DCO, and demonstrated applications in multi-panel comics and typography. The approach offers a practical pathway to create and share new style templates for personalized content creation while highlighting the importance of training emphasis on high-noise levels for effective style learning.
Abstract
Recent text-to-image diffusion models generate high-quality images but struggle to learn new, personalized styles, which limits the creation of unique style templates. In style-driven generation, users typically supply reference images exemplifying the desired style, together with text prompts that specify desired stylistic attributes. Previous approaches popularly rely on fine-tuning, yet it often blindly utilizes objectives and noise level distributions from pre-training without adaptation. We discover that stylistic features predominantly emerge at higher noise levels, leading current fine-tuning methods to exhibit suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enhances models' ability to capture novel styles indicated by reference images and text prompts. We demonstrate improved generation of novel styles that cannot be adequately described solely with a text prompt, enabling the creation of new style templates for personalized content creation.
