Table of Contents
Fetching ...

RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task

Chunshan Li, Rong Wang, Xiaofei Yang, Dianhui Chu

TL;DR

RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity, is proposed, offering a scalable solution for high-resolution remote sensing analysis.

Abstract

High-resolution remote sensing analysis faces challenges in global context modeling due to scene complexity and scale diversity. While CNNs excel at local feature extraction via parameter sharing, their fixed receptive fields fundamentally restrict long-range dependency modeling. Vision Transformers (ViTs) effectively capture global semantic relationships through self-attention mechanisms but suffer from quadratic computational complexity relative to image resolution, creating critical efficiency bottlenecks for high-resolution imagery. The RWKV model's linear-complexity sequence modeling achieves breakthroughs in NLP but exhibits anisotropic limitations in vision tasks due to its 1D scanning mechanism. To address these challenges, we propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity. This enables isotropic context aggregation across multiple directions. The MVC-Shift module enhances multi-scale receptive field coverage, while the ECA module strengthens cross-channel feature interaction and semantic saliency modeling. Experimental results demonstrate RSRWKV's superior performance over CNN and Transformer baselines in classification, detection, and segmentation tasks on NWPU RESISC45, VHR-10.v2, and GLH-Water datasets, offering a scalable solution for high-resolution remote sensing analysis.

RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task

TL;DR

RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity, is proposed, offering a scalable solution for high-resolution remote sensing analysis.

Abstract

High-resolution remote sensing analysis faces challenges in global context modeling due to scene complexity and scale diversity. While CNNs excel at local feature extraction via parameter sharing, their fixed receptive fields fundamentally restrict long-range dependency modeling. Vision Transformers (ViTs) effectively capture global semantic relationships through self-attention mechanisms but suffer from quadratic computational complexity relative to image resolution, creating critical efficiency bottlenecks for high-resolution imagery. The RWKV model's linear-complexity sequence modeling achieves breakthroughs in NLP but exhibits anisotropic limitations in vision tasks due to its 1D scanning mechanism. To address these challenges, we propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity. This enables isotropic context aggregation across multiple directions. The MVC-Shift module enhances multi-scale receptive field coverage, while the ECA module strengthens cross-channel feature interaction and semantic saliency modeling. Experimental results demonstrate RSRWKV's superior performance over CNN and Transformer baselines in classification, detection, and segmentation tasks on NWPU RESISC45, VHR-10.v2, and GLH-Water datasets, offering a scalable solution for high-resolution remote sensing analysis.

Paper Structure

This paper contains 37 sections, 14 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overall architecture of RSRWKV. (a) shows the overall backbone architecture. (b) shows the 2D-RWKV Block. (c) shows the Spacial Mix module. (d) shows the Channel Mix module. MVC-Shift denotes the multi-view context token shift. The "2D-WKV" denotes the 2D-WKV attention mechanism. "ECA" denotes the ECAwang2020eca module.
  • Figure 2: The processing pipeline of the 2D-WKV method involves taking the input feature map (with a channel depth of C/4) and replicating it through a channel duplication operation to create four sets of independent C-channel features. Each set of features is then scanned and expanded along the horizontal, vertical, and two diagonal directions. Following the Bi-WKV computation for each scanning sequence, the features are restored to their original spatial dimensions through an inverse operation. Finally, the processing outcomes from the four directions are concatenated and integrated along the channel dimension, forming a unified feature representation that encapsulates multi-directional spatial information.
  • Figure 3: Illustrations of different token shift mechanisms. The Quad-Shift duan2024vision fuses the current token with four adjacent tokens by linear interpolation. The Omni-Shift yang2024restore fuses the current token with tokens from all directions by different kernel size convolution. Ours MVC-Shift fuses the current token with tokens from all directions by different dilation convolution.
  • Figure 4: Comparison of segmentation results on the GLH-Water dataset's test set. As illustrated, our model demonstrates commendable performance in both overall segmentation and detail within the GLH-Water dataset. All images have been cropped to $512\times512$ pixels for comparison purposes.
  • Figure 5: Comparison of Attention Heatmaps in 2D-WKV and Bi-WKV. (a) The 2D-WKV heatmap shows four-directional Bi-WKV feature processing (WKV1-4) followed by receptance modulation (RWKV1-4) and final output projection. Same-row elements share a unified color scale to highlight directional feature interactions. (b) The Bi-WKV visualization sequentially displays input image, WKV computation, receptance modulation (RWKV), and output layers. Both visualizations correspond to the output weights from the first RWKV block in their respective architectures.
  • ...and 2 more figures