StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements
Mingkun Lei, Xue Song, Beier Zhu, Hao Wang, Chi Zhang
TL;DR
The paper tackles text-driven style transfer by addressing style overfitting, misalignment with prompts, and layout artifacts. It introduces three core contributions: cross-modal AdaIN to harmonize text and style conditioning without training, Style-Based CFG to selectively emphasize target stylistic elements using a negative style image, and a Teacher Model to stabilize layout during early denoising steps. Extensive evaluations show improved text alignment and style fidelity over state-of-the-art methods, with ablations confirming the complementary benefits of each component. The approach is compatible with existing adapter-based frameworks and remains fine-tuning-free, offering practical improvements for digital art, advertising, and game design where precise prompt adherence and stable layouts matter.
Abstract
Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
