Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer
Yanqi Ge, Jiaqi Liu, Qingnan Fan, Xi Jiang, Ye Huang, Shuai Qin, Hong Gu, Wen Li, Lixin Duan
TL;DR
This work tackles text-driven style transfer in diffusion-based T2I models, where prompt-level style injection often distorts content structure. It introduces Adaptive Style Incorporation (ASI), combining Siamese Cross-Attention (SiCA) for dual-content/style feature extraction and Adaptive Content-Style Blending (AdaBlending) with mask-guided, structure-aware fusion, all without model tuning. The approach yields superior structure preservation and stylization across real and generated images, validated through qualitative and quantitative evaluations and extensive ablations. The method promises practical utility for professional editing by enabling precise, locality-aware style transfer while maintaining semantic integrity, though it notes inversion limitations and increased computation from covariance-based masking.
Abstract
In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
