Table of Contents
Fetching ...

Mamba-ST: State Space Model for Efficient Style Transfer

Filippo Botti, Alex Ergasti, Leonardo Rossi, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

TL;DR

The paper tackles the high computational cost of state-of-the-art style transfer methods based on transformers and diffusion by introducing Mamba-ST, a full Mamba-based architecture that fuses content and style inside a vision State Space Model without AdaLN or cross-attention modules. It achieves this by adapting Mamba's inner equations to create a cross-attention–like fusion (ST-VSSM) using style-derived matrices and a content-derived projection, implemented via two encoders and a dedicated decoder, with a random style shuffle to isolate high-level style features. Through extensive experiments on COCO and WikiArt, the approach reports superior ArtFID and competitive FID while reducing memory and time requirements compared with diffusion-based methods and retaining solid content preservation (LPIPS/CFSD). The work demonstrates a practical, AdaLN-free style transfer pipeline with strong efficiency, suggesting broader applicability of Vision SSMs for image synthesis tasks and potential extensions to other cross-domain fusion problems.

Abstract

The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba's inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at https://github.com/FilippoBotti/MambaST.

Mamba-ST: State Space Model for Efficient Style Transfer

TL;DR

The paper tackles the high computational cost of state-of-the-art style transfer methods based on transformers and diffusion by introducing Mamba-ST, a full Mamba-based architecture that fuses content and style inside a vision State Space Model without AdaLN or cross-attention modules. It achieves this by adapting Mamba's inner equations to create a cross-attention–like fusion (ST-VSSM) using style-derived matrices and a content-derived projection, implemented via two encoders and a dedicated decoder, with a random style shuffle to isolate high-level style features. Through extensive experiments on COCO and WikiArt, the approach reports superior ArtFID and competitive FID while reducing memory and time requirements compared with diffusion-based methods and retaining solid content preservation (LPIPS/CFSD). The work demonstrates a practical, AdaLN-free style transfer pipeline with strong efficiency, suggesting broader applicability of Vision SSMs for image synthesis tasks and potential extensions to other cross-domain fusion problems.

Abstract

The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba's inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at https://github.com/FilippoBotti/MambaST.
Paper Structure (22 sections, 21 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 21 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: Examples of generated images from our Mamba model given a style and a content image
  • Figure 2: a) Mamba-ST full architecture. It takes as input a content and a style image and generates the content image stylized as the style image. b) Mamba encoder derived from liu2024vmambavisualstatespace with an additional skip connection (rightmost). c) Our Mamba-ST Decoder, which takes both style and content as input. In particular, style embeddings are shuffled before passing to ST-VSSM in order to loose spatial information, maintaining only higher level information. d) The inner architecture of the Base VSSM. e) The inner architecture of the Base 2D-SSM. f) Our ST-VSSM. Notably, DWConv is shared among content and style embedding. g) Our modified ST 2D-SSM, where the matrices $A$,$B$ and $\Delta$ are computed from the style, the input of the selective scan are the style embedding and the matrix C is calculated using the content.
  • Figure 3: The 2D selective scan with a $2\times 2$ example image.
  • Figure 4: Visual comparison with the current state-of-the-art models.
  • Figure 5: Zoomed results which show the patch problem inside the results. Gaps are present between each patch in the results and the model failed to uniformly apply the style.
  • ...and 3 more figures