Table of Contents
Fetching ...

Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

Haobo Yuan, Xiangtai Li, Lu Qi, Tao Zhang, Ming-Hsuan Yang, Shuicheng Yan, Chen Change Loy

TL;DR

This work designs a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency and designs an efficient decoder to utilize the multiscale tokens to obtain high-quality masks.

Abstract

Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.

Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model

TL;DR

This work designs a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency and designs an efficient decoder to utilize the multiscale tokens to obtain high-quality masks.

Abstract

Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.
Paper Structure (14 sections, 9 equations, 8 figures, 7 tables)

This paper contains 14 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: (Left) Comparisons between SAM kirillov2023segment, EfficientSAM xiong2023efficientsam, and our RWKV-SAM. (Right) FPS, parameters, and high-quality segmentation quality comparison of SAM kirillov2023segment, EfficientSAM xiong2023efficientsam, HQ-SAM sam_hq, and RWKV-SAM. The input image resolution is $1024 \times 1024$. We report the FPS and number of parameters of the backbone on one NVIDIA A100 GPU.
  • Figure 2: (Left) Overview of our RWKV-SAM. RWKV-SAM contains an image encoder, a prompt encoder, and a mask decoder. (Right) The efficient segmentation backbone architecture. The first two stages use the MBConv blocks, and the third uses the VRWKV blocks.
  • Figure 3: Latency (log scaled) of the backbone with different input image resolutions.
  • Figure 4: Visualization comparison. Given the box prompt as input, we show the predicted segmentation masks. Our RWKV-SAM shows better segmentation performance, especially in terms of detail.
  • Figure 5: The training datasets of RWKV-SAM. (Left) EntitySeg qi2022fine dataset. (Middle) COCONut-B deng2024coconut dataset. (Right) DIS5K qin2022 dataset.
  • ...and 3 more figures