Table of Contents
Fetching ...

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang

TL;DR

This work tackles real-time, multi-purpose segmentation by proposing RMP-SAM, a lightweight, single-model framework capable of image panoptic segmentation, video instance segmentation, and SAM-like interactive segmentation. It uses a shared decoder with pooling-based dynamic convolution and two asymmetric adapters to balance object- and prompt-driven queries, enabling prompt-driven decoding across tasks. The model is trained with a joint co-training strategy on COCO and YouTube-VIS, with CLIP-based labeling for unified taxonomy, and demonstrates state-of-the-art speed–accuracy trade-offs across multiple benchmarks while maintaining real-time performance. The results indicate strong practical potential for edge devices and real-time editing/tracking applications, and the work provides detailed ablations and guidance for extending unified real-time segmentation to additional datasets and tasks.

Abstract

Recent segmentation methods, which adopt large-scale data training and transformer architecture, aim to create one foundation model that can perform multiple tasks. However, most of these methods rely on heavy encoder and decoder frameworks, hindering their performance in real-time scenarios. To explore real-time segmentation, recent advancements primarily focus on semantic segmentation within specific environments, such as autonomous driving. However, they often overlook the generalization ability of these models across diverse scenarios. Therefore, to fill this gap, this work explores a novel real-time segmentation setting called real-time multi-purpose segmentation. It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation. Unlike previous methods, which use a specific design for each task, we aim to use only a single end-to-end model to accomplish all these tasks in real-time. To meet real-time requirements and balance multi-task learning, we present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM). It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding. Moreover, we further explore different training strategies and one new adapter design to boost co-training performance further. We benchmark several strong baselines by extending existing works to support our multi-purpose segmentation. Extensive experiments demonstrate that RMP-SAM is effective and generalizes well on proposed benchmarks and other specific semantic tasks. Our implementation of RMP-SAM achieves the optimal balance between accuracy and speed for these tasks.Our code and model are available at https://github.com/xushilin1/RAP-SAM/.

RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything

TL;DR

This work tackles real-time, multi-purpose segmentation by proposing RMP-SAM, a lightweight, single-model framework capable of image panoptic segmentation, video instance segmentation, and SAM-like interactive segmentation. It uses a shared decoder with pooling-based dynamic convolution and two asymmetric adapters to balance object- and prompt-driven queries, enabling prompt-driven decoding across tasks. The model is trained with a joint co-training strategy on COCO and YouTube-VIS, with CLIP-based labeling for unified taxonomy, and demonstrates state-of-the-art speed–accuracy trade-offs across multiple benchmarks while maintaining real-time performance. The results indicate strong practical potential for edge devices and real-time editing/tracking applications, and the work provides detailed ablations and guidance for extending unified real-time segmentation to additional datasets and tasks.

Abstract

Recent segmentation methods, which adopt large-scale data training and transformer architecture, aim to create one foundation model that can perform multiple tasks. However, most of these methods rely on heavy encoder and decoder frameworks, hindering their performance in real-time scenarios. To explore real-time segmentation, recent advancements primarily focus on semantic segmentation within specific environments, such as autonomous driving. However, they often overlook the generalization ability of these models across diverse scenarios. Therefore, to fill this gap, this work explores a novel real-time segmentation setting called real-time multi-purpose segmentation. It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation. Unlike previous methods, which use a specific design for each task, we aim to use only a single end-to-end model to accomplish all these tasks in real-time. To meet real-time requirements and balance multi-task learning, we present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM). It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding. Moreover, we further explore different training strategies and one new adapter design to boost co-training performance further. We benchmark several strong baselines by extending existing works to support our multi-purpose segmentation. Extensive experiments demonstrate that RMP-SAM is effective and generalizes well on proposed benchmarks and other specific semantic tasks. Our implementation of RMP-SAM achieves the optimal balance between accuracy and speed for these tasks.Our code and model are available at https://github.com/xushilin1/RAP-SAM/.
Paper Structure (15 sections, 5 equations, 10 figures, 14 tables)

This paper contains 15 sections, 5 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: We present real-time multi-purpose segmentation to segment and recognize objects for image, video, and visual prompt inputs. In addition to benchmarking, we propose a simple yet effective baseline, named RMP-SAM, which achieves the best performance and speed trade-off among three different tasks. The larger dot indicates more parameters.
  • Figure 2: RMP-SAM overview. Our method contains three visual inputs: image, video, and visual prompts. Utilizing positional encoding, we generate prompt queries from these visual prompts. The learnable object queries, prompt queries, and the feature map $F$ are directed to the multi-stage decoder. This process generates multi-stage predictions and refined queries. These refined queries engage in cross-attention with $F$, resulting in the final prediction.
  • Figure 3: Meta-architecture exploration. (a), Simple shared decoder design. (b), Decoupled decoder design with two heads. (c), Shared decoder with decoupled adapter. (d), Decoupled decoder with the decoupled adapters. Best viewed in color and zoom in.
  • Figure 4: The visualization results of YouTube-VIS 2019 and COCO datasets. The first two rows visualize five frames of inputs. The same instances are in the same color. The third row shows the interactive segmentation results with a single-point prompt (green color). The last row shows the panoptic segmentation results.
  • Figure 5: The visualization results of SAM-like methods on COCO instance segmentation. We adopt the ViTDet with ViT-b as the detector. The larger dot means more parameters.
  • ...and 5 more figures