Table of Contents
Fetching ...

StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture

Miaomiao Dai, Qianyu Zhou, Lizhuang Ma

TL;DR

StyleRWKV addresses NST efficiency by deploying a RWKV-like architecture with linear time complexity $O(n)$ and reduced memory usage. It introduces three core innovations within ST-RWKV blocks: Recurrent WKV attention (Re-WKV) for bidirectional global dependencies with linear cost, Deformable Shifting (Deform-Shifting) for adaptive local token interaction, and Skip Scanning (S-Scanning) for long-range context. The architecture employs a 4-level hierarchical encoder–decoder with AdaIN and multi-scale features to unify content preservation and style transfer. Empirical results on MS-COCO and WikiArt show superior stylization quality (LPIPS, FID, ArtFID) and lower inference time and memory usage compared to Transformer- and Mamba-based NST baselines, validating both effectiveness and efficiency of the approach.

Abstract

Style transfer aims to generate a new image preserving the content but with the artistic representation of the style source. Most of the existing methods are based on Transformers or diffusion models, however, they suffer from quadratic computational complexity and high inference time. RWKV, as an emerging deep sequence models, has shown immense potential for long-context sequence modeling in NLP tasks. In this work, we present a novel framework StyleRWKV, to achieve high-quality style transfer with limited memory usage and linear time complexity. Specifically, we propose a Recurrent WKV (Re-WKV) attention mechanism, which incorporates bidirectional attention to establish a global receptive field. Additionally, we develop a Deformable Shifting (Deform-Shifting) layer that introduces learnable offsets to the sampling grid of the convolution kernel, allowing tokens to shift flexibly and adaptively from the region of interest, thereby enhancing the model's ability to capture local dependencies. Finally, we propose a Skip Scanning (S-Scanning) method that effectively establishes global contextual dependencies. Extensive experiments with analysis including qualitative and quantitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of stylization quality, model complexity, and inference efficiency.

StyleRWKV: High-Quality and High-Efficiency Style Transfer with RWKV-like Architecture

TL;DR

StyleRWKV addresses NST efficiency by deploying a RWKV-like architecture with linear time complexity and reduced memory usage. It introduces three core innovations within ST-RWKV blocks: Recurrent WKV attention (Re-WKV) for bidirectional global dependencies with linear cost, Deformable Shifting (Deform-Shifting) for adaptive local token interaction, and Skip Scanning (S-Scanning) for long-range context. The architecture employs a 4-level hierarchical encoder–decoder with AdaIN and multi-scale features to unify content preservation and style transfer. Empirical results on MS-COCO and WikiArt show superior stylization quality (LPIPS, FID, ArtFID) and lower inference time and memory usage compared to Transformer- and Mamba-based NST baselines, validating both effectiveness and efficiency of the approach.

Abstract

Style transfer aims to generate a new image preserving the content but with the artistic representation of the style source. Most of the existing methods are based on Transformers or diffusion models, however, they suffer from quadratic computational complexity and high inference time. RWKV, as an emerging deep sequence models, has shown immense potential for long-context sequence modeling in NLP tasks. In this work, we present a novel framework StyleRWKV, to achieve high-quality style transfer with limited memory usage and linear time complexity. Specifically, we propose a Recurrent WKV (Re-WKV) attention mechanism, which incorporates bidirectional attention to establish a global receptive field. Additionally, we develop a Deformable Shifting (Deform-Shifting) layer that introduces learnable offsets to the sampling grid of the convolution kernel, allowing tokens to shift flexibly and adaptively from the region of interest, thereby enhancing the model's ability to capture local dependencies. Finally, we propose a Skip Scanning (S-Scanning) method that effectively establishes global contextual dependencies. Extensive experiments with analysis including qualitative and quantitative evaluations demonstrate that our approach outperforms state-of-the-art methods in terms of stylization quality, model complexity, and inference efficiency.
Paper Structure (13 sections, 14 equations, 7 figures, 4 tables)

This paper contains 13 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of existing neural style transfer methods. (a) Traditional (CNN-based or attention-based) methods suffer from limited receptive field or quadratic computational complexity, leading to unsatisfactory results. (b) Diffusion-based methods rely on multiple iterative denoising steps, making them significantly time-intensive. (c) Performance and efficiency comparison. (Left) Overall stylization performance acquired by different methods with relative parameters, (Right) FLOPs increase with sequence length.
  • Figure 2: (a) Style-RWKV Architecture. (b) Our ST-RWKV block incorporates a Re-WKV attention mechanism to model global dependencies with linear complexity, while a Deform-Shifting layer captures the local context within the ROIs. (c) Top: Re-WKV introduces Bi-WKV attention recurrently along the S-scanning directions, effectively achieving a global receptive field. Bottom: The S-Scanning mechanism skips samples (with a step of 2), performs an intra-group traversal, and merges inter-group sequences, enabling long-range patch connections.
  • Figure 3: Illustrations of different token shifting mechanisms.
  • Figure 4: Qualitative comparison with state-of-the-art neural style transfer methods, e.g., Feedforward Neural Network (FFN)- (AesPA-Net, CAST, EFDM, AdaAttN, AdaIN), DMs- (StyleID, DiffuseIT, InST), Transformer-based (StyTR$^2$), and Mamba-based (MambaST) methods.
  • Figure 5: Qualitative ablations on different recurrence number.
  • ...and 2 more figures