Table of Contents
Fetching ...

SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer

Hongda Liu, Longguang Wang, Ye Zhang, Ziru Yu, Yulan Guo

TL;DR

The paper addresses the trade-off between global receptive field and computational efficiency in arbitrary image style transfer. It introduces SaMam, a Style-aware Mamba framework with a Content and Style Mamba Encoder and a Style-aware Mamba Decoder, including a Vision State Space Module (SAVSSM) built around a style-conditioned S7 Block and a Zigzag Scan strategy. Key contributions include a pluggable, style-adaptive decoding module, a Local Enhancement to mitigate local pixel forgetting, and multiple style-aware components (SConv, SCM, SAIN) that enable flexible style transfer with linear complexity. Experiments on MS-COCO and WikiArt show SaMam achieving state-of-the-art balance between stylization quality and efficiency, validated by perceptual and structural metrics as well as ablation studies highlighting the components' impact.

Abstract

Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.

SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer

TL;DR

The paper addresses the trade-off between global receptive field and computational efficiency in arbitrary image style transfer. It introduces SaMam, a Style-aware Mamba framework with a Content and Style Mamba Encoder and a Style-aware Mamba Decoder, including a Vision State Space Module (SAVSSM) built around a style-conditioned S7 Block and a Zigzag Scan strategy. Key contributions include a pluggable, style-adaptive decoding module, a Local Enhancement to mitigate local pixel forgetting, and multiple style-aware components (SConv, SCM, SAIN) that enable flexible style transfer with linear complexity. Experiments on MS-COCO and WikiArt show SaMam achieving state-of-the-art balance between stylization quality and efficiency, validated by perceptual and structural metrics as well as ablation studies highlighting the components' impact.

Abstract

Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Trade-off between inference time $t$ (ms) and ArtFID wright2022artfid achieved by different methods. The size of a circle represents MACs (G).
  • Figure 2: An overview of our SaMam framework (a) and an illustration of the selective scan methods in Vision mamba zhu2024vision and VMamba liu2024vmambavisualstatespace (b).
  • Figure 3: The detailed architecture of Style-aware Vision State Space Module (SAVSSM).
  • Figure 4: Comparison of different norm strategies.
  • Figure 5: Qualitative comparison with previous state-of-the-art methods.
  • ...and 5 more figures