StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture
Miaomiao Dai, Qianyu Zhou, Lizhuang Ma
TL;DR
StyleRWKV addresses NST efficiency by deploying a RWKV-like architecture with linear time complexity $O(n)$ and reduced memory usage. It introduces three core innovations within ST-RWKV blocks: Recurrent WKV attention (Re-WKV) for bidirectional global dependencies with linear cost, Deformable Shifting (Deform-Shifting) for adaptive local token interaction, and Skip Scanning (S-Scanning) for long-range context. The architecture employs a 4-level hierarchical encoder–decoder with AdaIN and multi-scale features to unify content preservation and style transfer. Empirical results on MS-COCO and WikiArt show superior stylization quality (LPIPS, FID, ArtFID) and lower inference time and memory usage compared to Transformer- and Mamba-based NST baselines, validating both effectiveness and efficiency of the approach.
Abstract
Style transfer aims to generate a new image preserving the content but with the artistic representation of the style source. Most of the existing methods are based on Transformers or diffusion models, however, they suffer from quadratic computational complexity and high inference time. RWKV, as an emerging deep sequence models, has shown immense potential for long-context sequence modeling in NLP tasks. In this work, we present a novel framework StyleRWKV, to achieve high-quality style transfer with limited memory usage and linear time complexity. Specifically, we propose a Recurrent WKV (Re-WKV) attention mechanism, which incorporates bidirectional attention to establish a global receptive field. Additionally, we develop a Deformable Shifting (Deform-Shifting) layer that introduces learnable offsets to the sampling grid of the convolution kernel, allowing tokens to shift flexibly and adaptively from the region of interest, thereby enhancing the model's ability to capture local dependencies. Finally, we propose a Skip Scanning (S-Scanning) method that effectively establishes global contextual dependencies. Extensive experiments with analysis including qualitative and quantitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of stylization quality, model complexity, and inference efficiency.
