Table of Contents
Fetching ...

StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer

Zijia Wang, Zhi-Song Liu

TL;DR

StyleMamba tackles the high computational cost of text-driven image style transfer by embedding a conditional State Space Model (Mamba) into a pretrained VAE-based AutoEncoder and using SigLIP as the style-text encoder. It introduces a masked directional loss and a second-order directional loss to align text prompts with image stylization while preserving content, enabling faster convergence and inference. Experimental results on COCO and WikiArt show superior alignment to text prompts, better content preservation, and competitive aesthetics compared with baselines, along with substantial speedups over prior methods. The approach generalizes to diverse applications, including multi-style transfers and design tasks, though it exhibits limitations with rare prompts and lacks segmentation-based control. Overall, StyleMamba provides a practical, efficient pipeline for text-guided styling with broad potential for creative and design workflows.

Abstract

We present StyleMamba, an efficient image style transfer framework that translates text prompts into corresponding visual styles while preserving the content integrity of the original images. Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources. To speed up the process, we propose a conditional State Space Model for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that sequentially aligns the image features to the target text prompts. To enhance the local and global style consistency between text and image, we propose masked and second-order directional losses to optimize the stylization direction to significantly reduce the training iterations by 5 times and the inference time by 3 times. Extensive experiments and qualitative evaluation confirm the robust and superior stylization performance of our methods compared to the existing baselines.

StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer

TL;DR

StyleMamba tackles the high computational cost of text-driven image style transfer by embedding a conditional State Space Model (Mamba) into a pretrained VAE-based AutoEncoder and using SigLIP as the style-text encoder. It introduces a masked directional loss and a second-order directional loss to align text prompts with image stylization while preserving content, enabling faster convergence and inference. Experimental results on COCO and WikiArt show superior alignment to text prompts, better content preservation, and competitive aesthetics compared with baselines, along with substantial speedups over prior methods. The approach generalizes to diverse applications, including multi-style transfers and design tasks, though it exhibits limitations with rare prompts and lacks segmentation-based control. Overall, StyleMamba provides a practical, efficient pipeline for text-guided styling with broad potential for creative and design workflows.

Abstract

We present StyleMamba, an efficient image style transfer framework that translates text prompts into corresponding visual styles while preserving the content integrity of the original images. Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources. To speed up the process, we propose a conditional State Space Model for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that sequentially aligns the image features to the target text prompts. To enhance the local and global style consistency between text and image, we propose masked and second-order directional losses to optimize the stylization direction to significantly reduce the training iterations by 5 times and the inference time by 3 times. Extensive experiments and qualitative evaluation confirm the robust and superior stylization performance of our methods compared to the existing baselines.
Paper Structure (10 sections, 6 equations, 7 figures, 6 tables)

This paper contains 10 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparative results showcasing the efficacy of the StyleMamba framework. a) highlights the rapid convergence and stylization capabilities with fewer of epochs. b) and c) demonstrate the detailed stylization fidelity and the transferability of various styles, including painting and lighting styles, as well as some complicated styles like environmental style. Finally, d) shows the great content preservation ability of StyleMamba with competitive style transfer performance compared with other style transfer models. All results reflect the superiority of StyleMamba .
  • Figure 2: Workflow overview of StyleMamba framework. The process begins with a content image and a style prompt (e.g., "Paul Gauguin style"). An encoder converts the content image into a latent representation, which undergoes style fusion with features derived from the style prompt. This fusion is facilitated by the Style Fusion Module, incorporating masked and second-order directional losses to guide the text-to-image stylization. The result is a stylized image $\textbf{Y}$, which closely adheres to the style prompt while preserving content integrity.
  • Figure 3: Illustration of the proposed second-order directional loss. It shows how $L_{so}$ allows for rapid adjustments in the direction of stylization. Notably, it facilitates refined stylistic shifts, ensuring a swift and coherent transition towards the desired visual style.
  • Figure 4: Qualitive comparison with SOTA algorithms. We show three cases of text-guided style transfer. For reference, we use the text to retrieve a reference image for style comparison and the input to InstantStyle.
  • Figure 5: Speed comparison between StyleMamba and Clipstyler. The graph depicts the style loss and corresponding intermediate results over time, illustrating the convergence efficiency of each method. The result of StyleMamba at epoch 15 shows a more detailed stylization result than Clipstyler's 100 epoch result.
  • ...and 2 more figures