RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything
Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang
TL;DR
This work tackles real-time, multi-purpose segmentation by proposing RMP-SAM, a lightweight, single-model framework capable of image panoptic segmentation, video instance segmentation, and SAM-like interactive segmentation. It uses a shared decoder with pooling-based dynamic convolution and two asymmetric adapters to balance object- and prompt-driven queries, enabling prompt-driven decoding across tasks. The model is trained with a joint co-training strategy on COCO and YouTube-VIS, with CLIP-based labeling for unified taxonomy, and demonstrates state-of-the-art speed–accuracy trade-offs across multiple benchmarks while maintaining real-time performance. The results indicate strong practical potential for edge devices and real-time editing/tracking applications, and the work provides detailed ablations and guidance for extending unified real-time segmentation to additional datasets and tasks.
Abstract
Recent segmentation methods, which adopt large-scale data training and transformer architecture, aim to create one foundation model that can perform multiple tasks. However, most of these methods rely on heavy encoder and decoder frameworks, hindering their performance in real-time scenarios. To explore real-time segmentation, recent advancements primarily focus on semantic segmentation within specific environments, such as autonomous driving. However, they often overlook the generalization ability of these models across diverse scenarios. Therefore, to fill this gap, this work explores a novel real-time segmentation setting called real-time multi-purpose segmentation. It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation. Unlike previous methods, which use a specific design for each task, we aim to use only a single end-to-end model to accomplish all these tasks in real-time. To meet real-time requirements and balance multi-task learning, we present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM). It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding. Moreover, we further explore different training strategies and one new adapter design to boost co-training performance further. We benchmark several strong baselines by extending existing works to support our multi-purpose segmentation. Extensive experiments demonstrate that RMP-SAM is effective and generalizes well on proposed benchmarks and other specific semantic tasks. Our implementation of RMP-SAM achieves the optimal balance between accuracy and speed for these tasks.Our code and model are available at https://github.com/xushilin1/RAP-SAM/.
