Table of Contents
Fetching ...

SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer

Chunnan Shang, Zhizhong Wang, Hongwei Wang, Xiangming Meng

TL;DR

This work tackles semantic-region inconsistencies in attention-based arbitrary style transfer by introducing SCSA, a plug-and-play mechanism with two components: Semantic Continuous Attention (SCA) and Semantic Sparse Attention (SSA). SCA attends to all continuous key points within the same semantic region to capture global region-level style, while SSA focuses on the most similar sparse key point to preserve region-specific textures, with semantic adaptive normalization aligning content features. Integrated into existing Attn-AST frameworks without retraining, SCSA yields more coherent semantic stylization and richer textures, as validated on CNN, Transformer, and Diffusion backbones and against multiple SOTA baselines. The results demonstrate improved SSL, CFSD, and FID, along with favorable user study outcomes, highlighting SCSA’s broad applicability for high-quality semantic style transfer.

Abstract

Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer -- each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images.

SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer

TL;DR

This work tackles semantic-region inconsistencies in attention-based arbitrary style transfer by introducing SCSA, a plug-and-play mechanism with two components: Semantic Continuous Attention (SCA) and Semantic Sparse Attention (SSA). SCA attends to all continuous key points within the same semantic region to capture global region-level style, while SSA focuses on the most similar sparse key point to preserve region-specific textures, with semantic adaptive normalization aligning content features. Integrated into existing Attn-AST frameworks without retraining, SCSA yields more coherent semantic stylization and richer textures, as validated on CNN, Transformer, and Diffusion backbones and against multiple SOTA baselines. The results demonstrate improved SSL, CFSD, and FID, along with favorable user study outcomes, highlighting SCSA’s broad applicability for high-quality semantic style transfer.

Abstract

Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer -- each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images.

Paper Structure

This paper contains 24 sections, 38 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Comparisons of the Attn-AST approaches--CNN-based SANet park2019arbitrary, Transformer-based StyTR$^2$deng2022stytr2, and Diffusion-based StyleID chung2024style--without and with our SCSA. The stylized images generated by the Attn-AST approaches exhibit style discontinuity in adjacent regions within the identical semantic regions (e.g., the background in the $1st$ row), style inconsistency between corresponding semantic (e.g., the cloth color in the $3rd$ row), and fewer textures (e.g., the block textures in the $3rd$ row).
  • Figure 2: Comparison between Semantic Continuous-Sparse Attention (SCSA) and Universal Attention (UA). SCSA includes two parts: (a) Semantic Continuous Attention (SCA): The query point of the content semantic map features can match all continuous key points of the style semantic map features in the same semantic region. Therefore, SCA can fully account for the overall stylistic characteristics (e.g., color and texture) of regions with the same semantics; (b) Semantic Sparse Attention (SSA): The query point of the content image features can match the most similar sparse key point of the style image features in the same semantic region. Hence, SSA can intently concentrate on the specific stylistic texture of regions with the same semantics. In contrast, (c) Universal Attention (UA) park2019arbitrary: The query point of the content image features pays attention to all key points of the style image features. As a result, UA fails to accurately convey the intricate overall stylistic characteristics and specific textures of regions that share identical semantics.
  • Figure 3: Detailed Procedure of Semantic Continuous-Sparse Attention (SCSA). In (a), S-AdaIN denotes that AdaIN huang2017arbitrary is applied individually for each semantic region. It initializes $F_c$, by matching the feature statistics of corresponding semantic regions in $F_c$ and $F_s$, to ensure that $F_c$ has semantically aligned color style information to some extent. Subsequently, $F_c$ can provide a more accurate query $f_q(\bar{F}_c)$ and more pure content structure features with less interference from the original color style. SCSA in (a) includes Semantic Continuous Attention (SCA) and Semantic Sparse Attention (SSA) two parts. “G1” in SCA sets the value of the specific product point generated by multiplying two points from different semantic categories to negative infinity. As shown in (b), only the product of points with the same semantics is retained. Therefore, SCA can fully account for the overall stylistic characteristics of regions with the same semantics. “G2” in SSA retains only the maximum value of the product of a specified query point with all key points in the same semantic region, while all other values are set to negative infinity, as shown in (c). Hence, SSA can intently concentrate on the specific stylistic texture of the most similar structure in the regions with the same semantics. After SCSA, the generated $F_{cs}$from (a) not only reflects the overall stylistic characteristics of the semantic region corresponding to $F_s$ but also captures the specific textures of that semantic region. Meanwhile, $F_{cs}$ also includes the content structure of $F_c$.
  • Figure 4: The overall frameworks of Attn-AST approaches--CNN-based, Transformer-based, and Diffusion-based methods--with semantic continuous-sparse attention (SCSA). SCA depicts semantic continuous attention. SSA denotes semantic sparse attention. S-AdaIN is semantic adaptive instance normalization. UA expresses universal attention in the Attn-AST methods. The dashed line in (b) shows encoded content features $F_c$ are only used in the first use of the feature transformation module, and the output features $F_{cs}$ are used as new content features in subsequent transformation. The dashed line in (c) indicates S-AdaIN is used only when $t$ is the maximum time step $T$.
  • Figure 5: Qualitative comparisons among Attn-AST approaches, those with SCSA, and SOTA methods.
  • ...and 17 more figures