SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer
Chunnan Shang, Zhizhong Wang, Hongwei Wang, Xiangming Meng
TL;DR
This work tackles semantic-region inconsistencies in attention-based arbitrary style transfer by introducing SCSA, a plug-and-play mechanism with two components: Semantic Continuous Attention (SCA) and Semantic Sparse Attention (SSA). SCA attends to all continuous key points within the same semantic region to capture global region-level style, while SSA focuses on the most similar sparse key point to preserve region-specific textures, with semantic adaptive normalization aligning content features. Integrated into existing Attn-AST frameworks without retraining, SCSA yields more coherent semantic stylization and richer textures, as validated on CNN, Transformer, and Diffusion backbones and against multiple SOTA baselines. The results demonstrate improved SSL, CFSD, and FID, along with favorable user study outcomes, highlighting SCSA’s broad applicability for high-quality semantic style transfer.
Abstract
Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer -- each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images.
