StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer
Zijia Wang, Zhi-Song Liu
TL;DR
StyleMamba tackles the high computational cost of text-driven image style transfer by embedding a conditional State Space Model (Mamba) into a pretrained VAE-based AutoEncoder and using SigLIP as the style-text encoder. It introduces a masked directional loss and a second-order directional loss to align text prompts with image stylization while preserving content, enabling faster convergence and inference. Experimental results on COCO and WikiArt show superior alignment to text prompts, better content preservation, and competitive aesthetics compared with baselines, along with substantial speedups over prior methods. The approach generalizes to diverse applications, including multi-style transfers and design tasks, though it exhibits limitations with rare prompts and lacks segmentation-based control. Overall, StyleMamba provides a practical, efficient pipeline for text-guided styling with broad potential for creative and design workflows.
Abstract
We present StyleMamba, an efficient image style transfer framework that translates text prompts into corresponding visual styles while preserving the content integrity of the original images. Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources. To speed up the process, we propose a conditional State Space Model for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that sequentially aligns the image features to the target text prompts. To enhance the local and global style consistency between text and image, we propose masked and second-order directional losses to optimize the stylization direction to significantly reduce the training iterations by 5 times and the inference time by 3 times. Extensive experiments and qualitative evaluation confirm the robust and superior stylization performance of our methods compared to the existing baselines.
