Table of Contents
Fetching ...

MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation

Long Xu, Shanghong Li, Yongquan Chen, Jun Luo, Shiwu Lai

TL;DR

The paper tackles the problem of scale variation in interactive segmentation by introducing MST, an adaptive multi-scale token framework that selects informative tokens from $8\times8$, $16\times16$, and $28\times28$ patches to guide base ViT tokens. It uses a differentiable top-$k$ mechanism to pick tokens based on similarity to a kernel derived from user clicks and fuses tokens via Cross Attention, while a triplet-like contrastive loss $\mathcal{L}_{c}$ improves token discrimination against background tokens. The approach yields state-of-the-art results across standard benchmarks and generalizes well to remote sensing data, with notable reductions in the Number of Clicks $NoC$ and strong mask-correction performance on DAVIS-585. The work provides a practical, scalable solution for interactive segmentation with robust handling of multi-scale targets and contributes to reproducibility through released code and demos.

Abstract

Interactive segmentation has gained significant attention for its application in human-computer interaction and data annotation. To address the target scale variation issue in interactive segmentation, a novel multi-scale token adaptation algorithm is proposed. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified while ensuring performance. To enhance the robustness of multi-scale token selection, we also propose a token learning algorithm based on contrastive loss. This algorithm can effectively improve the performance of multi-scale token adaptation. Extensive benchmarking shows that the algorithm achieves state-of-the-art (SOTA) performance, compared to current methods. An interactive demo and all reproducible codes will be released at https://github.com/hahamyt/mst.

MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation

TL;DR

The paper tackles the problem of scale variation in interactive segmentation by introducing MST, an adaptive multi-scale token framework that selects informative tokens from , , and patches to guide base ViT tokens. It uses a differentiable top- mechanism to pick tokens based on similarity to a kernel derived from user clicks and fuses tokens via Cross Attention, while a triplet-like contrastive loss improves token discrimination against background tokens. The approach yields state-of-the-art results across standard benchmarks and generalizes well to remote sensing data, with notable reductions in the Number of Clicks and strong mask-correction performance on DAVIS-585. The work provides a practical, scalable solution for interactive segmentation with robust handling of multi-scale targets and contributes to reproducibility through released code and demos.

Abstract

Interactive segmentation has gained significant attention for its application in human-computer interaction and data annotation. To address the target scale variation issue in interactive segmentation, a novel multi-scale token adaptation algorithm is proposed. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified while ensuring performance. To enhance the robustness of multi-scale token selection, we also propose a token learning algorithm based on contrastive loss. This algorithm can effectively improve the performance of multi-scale token adaptation. Extensive benchmarking shows that the algorithm achieves state-of-the-art (SOTA) performance, compared to current methods. An interactive demo and all reproducible codes will be released at https://github.com/hahamyt/mst.
Paper Structure (20 sections, 8 equations, 10 figures, 4 tables)

This paper contains 20 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The differences between the proposed algorithm and the existing methods, including the processing of click points and the use of multi-scale features
  • Figure 2: The average IOU varies with clicks (based on the average results of 6 benchmarks), indicating the proposed method can utilize fewer clicks to obtain better precision
  • Figure 3: The overall framework of the proposed algorithm. The adaptive patch embedding module is adopted to extract multi-scale tokens adaptively, base tokens denotes the tokens with patch size 16$\times16$, multi-scale tokens denotes the tokens with patch size $8\times 8$ and $28\times 28$. MST block is the proposed multi-scale token fusion module, more details can be found in Fig. \ref{['fig:multiscale']}. For efficient training, we utilize a random selection module, which has no impact during inference. The triplet token loss module represents our proposed contrastive loss-based token learning algorithm, more details can be found in Fig. \ref{['fig:contrastloss']}. A simple FPN is adopted in this paper, and segmentation is performed using a two-layer MLP module
  • Figure 4: The proposed multi-scale tokens interactive module. The ViT Block denotes the original vision transformer block. The Score Block is the proposed algorithm for selecting important tokens. The Cross Attention is a feature fusion algorithm; further details can be found in fu2019dual. The Scaled Cross Attention is an efficient cross-attention algorithm referenced in ren2022shunted. The Top-K operation is a PyTorch function that selects the largest $K$ values in a vector. The term clicked tokens refers to the local region that has been clicked
  • Figure 5: Triplet patch loss computation diagram. Red squares represent tokens that do not belong to the target, yellow squares represent tokens that belong to the target, and yellow asterisks represent user-clicked points
  • ...and 5 more figures