Table of Contents
Fetching ...

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Mingkun Lei, Xue Song, Beier Zhu, Hao Wang, Chi Zhang

TL;DR

The paper tackles text-driven style transfer by addressing style overfitting, misalignment with prompts, and layout artifacts. It introduces three core contributions: cross-modal AdaIN to harmonize text and style conditioning without training, Style-Based CFG to selectively emphasize target stylistic elements using a negative style image, and a Teacher Model to stabilize layout during early denoising steps. Extensive evaluations show improved text alignment and style fidelity over state-of-the-art methods, with ablations confirming the complementary benefits of each component. The approach is compatible with existing adapter-based frameworks and remains fine-tuning-free, offering practical improvements for digital art, advertising, and game design where precise prompt adherence and stable layouts matter.

Abstract

Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

TL;DR

The paper tackles text-driven style transfer by addressing style overfitting, misalignment with prompts, and layout artifacts. It introduces three core contributions: cross-modal AdaIN to harmonize text and style conditioning without training, Style-Based CFG to selectively emphasize target stylistic elements using a negative style image, and a Teacher Model to stabilize layout during early denoising steps. Extensive evaluations show improved text alignment and style fidelity over state-of-the-art methods, with ablations confirming the complementary benefits of each component. The approach is compatible with existing adapter-based frameworks and remains fine-tuning-free, offering practical improvements for digital art, advertising, and game design where precise prompt adherence and stable layouts matter.

Abstract

Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.

Paper Structure

This paper contains 20 sections, 7 equations, 28 figures, 2 tables, 1 algorithm.

Figures (28)

  • Figure 1: Results of our text-driven style transfer model. Given a style reference image, our method effectively reduces style overfitting, generating images that faithfully align with the text prompt while maintaining consistent layout structure across varying styles.
  • Figure 2: Illustration of overfitting issues in text-to-image generation, where the model tends to follow dominant colors or patterns from the style image rather than aligning precisely with the text prompt. Each prompt follows the format "A $<$color$>$$<$object$>$." From top to bottom, the objects are: bear, apple, frog, and car.
  • Figure 3: Illustration of the checkerboard artifact encountered in the CSGO xing2024csgo method during inference. The leftmost column shows the results generated by SDXL podell2023sdxl. The prompts, from top to bottom, are "A red apple" and "A pink cup." All generated results use the same initial noise latent.
  • Figure 4: The illustration of our proposed Cross-Modal AdaIN, Teacher Model, Style-Based CFG.
  • Figure 5: Visualization of the Cross-Attention Map for the word "apple" in the prompt "A red apple" during the generation process. When artifacts appear, the attention tends to scatter as well.
  • ...and 23 more figures