Table of Contents
Fetching ...

RMT: Retentive Networks Meet Vision Transformers

Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He

TL;DR

RMT tackles the core limitations of Vision Transformers by injecting explicit spatial priors into self-attention. It extends RetNet's distance-based retention to a two-dimensional Manhattan distance, creating MaSA, and preserves a rich spatial prior through a decomposed attention form that achieves linear complexity. Empirical results across ImageNet classification, COCO detection/segmentation, and ADE20K semantic segmentation demonstrate strong gains over state-of-the-art backbones, with robust speed-accuracy trade-offs. The approach provides a scalable, spatially informed backbone for general vision tasks, supported by open-source code.

Abstract

Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial domain, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an attention decomposition form that adeptly adapts to explicit spatial prior is proposed, aiming to reduce the computational burden of modeling global information without disrupting the spatial decay matrix. Based on the spatial decay matrix and the attention decomposition form, we can flexibly integrate explicit spatial prior into the vision backbone with linear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically, without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is available at https://github.com/qhfan/RMT

RMT: Retentive Networks Meet Vision Transformers

TL;DR

RMT tackles the core limitations of Vision Transformers by injecting explicit spatial priors into self-attention. It extends RetNet's distance-based retention to a two-dimensional Manhattan distance, creating MaSA, and preserves a rich spatial prior through a decomposed attention form that achieves linear complexity. Empirical results across ImageNet classification, COCO detection/segmentation, and ADE20K semantic segmentation demonstrate strong gains over state-of-the-art backbones, with robust speed-accuracy trade-offs. The approach provides a scalable, spatially informed backbone for general vision tasks, supported by open-source code.

Abstract

Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial domain, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an attention decomposition form that adeptly adapts to explicit spatial prior is proposed, aiming to reduce the computational burden of modeling global information without disrupting the spatial decay matrix. Based on the spatial decay matrix and the attention decomposition form, we can flexibly integrate explicit spatial prior into the vision backbone with linear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically, without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is available at https://github.com/qhfan/RMT
Paper Structure (40 sections, 8 equations, 4 figures, 12 tables)

This paper contains 40 sections, 8 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: FLOPs v.s. Top-1 accuracy on ImageNet-1K with $224\times 224$ resolution. "*" indicates the model trained with token labeling tokenlabel.
  • Figure 2: Comparison among different Self-Attention mechanisms. In MaSA, darker colors represent smaller spatial decay rates, while lighter colors represent larger ones. The spatial decay rates that change with distance provide the model with rich spatial priors.
  • Figure 3: Overall architecture of RMT.
  • Figure 4: Spatial decay matrix in the decomposed MaSA.