Table of Contents
Fetching ...

RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

Juntao Jiang, Jiangning Zhang, Weixuan Liu, Muxuan Gao, Xiaobin Hu, Zhucun Xue, Yong Liu, Shuicheng Yan

TL;DR

RWKV-UNet addresses the challenge of capturing long-range dependencies in medical image segmentation without incurring the high costs of full self-attention. It integrates the Receptance Weighted Key Value (RWKV) mechanism into a U-Net via Global-Local Spatial Perception (GLSP) blocks and Cross-Channel Mix (CCM) skip connections, pairing a robust encoder with a large-kernel decoder. The approach includes pre-trained encoders and scalable variants (Enc-T/S/B; RWKV-UNet-S/T) to balance accuracy and efficiency, and demonstrates state-of-the-art performance across 11 diverse medical imaging datasets. While highly effective in 2D segmentation, future work will extend to 3D volumes and ultra-lightweight RWKV configurations for broader clinical applicability.

Abstract

In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.

RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

TL;DR

RWKV-UNet addresses the challenge of capturing long-range dependencies in medical image segmentation without incurring the high costs of full self-attention. It integrates the Receptance Weighted Key Value (RWKV) mechanism into a U-Net via Global-Local Spatial Perception (GLSP) blocks and Cross-Channel Mix (CCM) skip connections, pairing a robust encoder with a large-kernel decoder. The approach includes pre-trained encoders and scalable variants (Enc-T/S/B; RWKV-UNet-S/T) to balance accuracy and efficiency, and demonstrates state-of-the-art performance across 11 diverse medical imaging datasets. While highly effective in 2D segmentation, future work will extend to 3D volumes and ultra-lightweight RWKV configurations for broader clinical applicability.

Abstract

In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
Paper Structure (33 sections, 10 equations, 5 figures, 14 tables)

This paper contains 33 sections, 10 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Comparative analysis of CNN-, Transformer-, Mamba-, RWKV-, and hybrid-based segmentation models, highlighting their respective strengths and weaknesses.
  • Figure 2: Overall architecture of the proposed RWKV-UNet. (a) the encoder with four stages constructed by stacked LP blocks and stacked GLSP blocks, (b) Cross-Channel Mix (CCM) Module for multi-scale fusion, (c) the decoder with four stages, (d) the Local Perception (LP) block, (e) the RWKV-based Global-Local Spatial Perception (GLSP) block, (f) the decoder block constructed by a point-convolution layer and a $9\times9$ DW-Conv layer, with a convolution and an upsampling operation.
  • Figure 3: Comparison visualization of effective receptive fields of the last layer output using different attention mechanisms in the GLSP module.
  • Figure 4: Performance of different methods on the Synapse multi-organ segmentation dataset. The average DSC (%) is plotted against FLOPs (G). The size of each circle represents the model's parameter count. RWKV-UNet achieves SOTA performance with balanced computation cost, while RWKV-UNet-S and RWKV-UNet-T also achieve remarkable results.
  • Figure 5: A qualitative comparison with previous SOTA methods on the Synapse dataset. The visual results demonstrate that our method achieves more accurate segmentation, especially in difficult tasks like pancreas segmentation.