Table of Contents
Fetching ...

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

TL;DR

This work addresses the high computational cost of Vision Transformers for high-resolution images by introducing Vision-RWKV (VRWKV), a vision encoder with linear-complexity attention inspired by RWKV. It combines a patch-based architecture with a bidirectional, linear-time attention mechanism (Bi-WKV) and a quad-directional token shift (Q-Shift) to expand receptive fields while maintaining scalability and stability. Across ImageNet classification, COCO detection, and ADE20K segmentation, VRWKV matches or exceeds ViT performance at lower or comparable FLOPs and exhibits strong robustness to input resolution and large-scale pretraining. The results suggest VRWKV as an efficient, scalable alternative to ViT for diverse visual perception tasks, particularly at high resolutions, with code released for replication.

Abstract

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code is released at https://github.com/OpenGVLab/Vision-RWKV.

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

TL;DR

This work addresses the high computational cost of Vision Transformers for high-resolution images by introducing Vision-RWKV (VRWKV), a vision encoder with linear-complexity attention inspired by RWKV. It combines a patch-based architecture with a bidirectional, linear-time attention mechanism (Bi-WKV) and a quad-directional token shift (Q-Shift) to expand receptive fields while maintaining scalability and stability. Across ImageNet classification, COCO detection, and ADE20K segmentation, VRWKV matches or exceeds ViT performance at lower or comparable FLOPs and exhibits strong robustness to input resolution and large-scale pretraining. The results suggest VRWKV as an efficient, scalable alternative to ViT for diverse visual perception tasks, particularly at high resolutions, with code released for replication.

Abstract

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code is released at https://github.com/OpenGVLab/Vision-RWKV.
Paper Structure (22 sections, 12 equations, 5 figures, 7 tables)

This paper contains 22 sections, 12 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Performance and efficiency comparison of Vision-RWKV (VRWKV) and ViT. (a) Bounding box average precision ($\rm AP^b$) comparison of VRWKV and ViT touvron2021deit with window attention and global attention on the COCO lin2014microsoft dataset. (b) Inference speed comparison of VRWKV-T and ViT-T across input resolutions ranging from 224 to 2048. (c) GPU memory comparison of VRWKV-T and ViT-T across input resolutions from 224 to 2048.
  • Figure 2: Overall architecture of VRWKV. (a) The VRWKV architecture includes $L$ identical VRWKV encoder layers, an average pooling layer, and a linear prediction head. (b) The details of the VRWKV encoder layer. Q-Shift denotes the quad-directional shift method tailed for vision tasks. The "Bi-WKV" module served as a bidirectional RNN cell or a global attention mechanism.
  • Figure 3: Comparison of effective receptive field (ERF) and attention runtime. (a) ERF for ViT and VRWKV in different settings. "No Shift" means no shift is used in spatial-mix and channel-mix modules. "RWKV Attn" means the original RWKV attention without our modifications. Our VRWKV with Q-Shift and Bi-WKV has a more comprehensive ERF than other counterparts. (b) Attention runtime of inference (left) and forward $+$ backward (right) tested on an Nvidia A100 GPU.
  • Figure 4: Performance of VRWKV and DeiT touvron2021deit on ImageNet-1K deng2009imagenet. All models are trained on a fixed resolution of $224 \times 224$ and evaluated on different resolutions. Our VRWKV shows an obvious robustness on different resolutions.
  • Figure 5: Inference time of attention mechanisms. Input resolutions are scanned from 224 to 1024. All experiments are run on Nvidia A100.