MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation
Long Xu, Shanghong Li, Yongquan Chen, Jun Luo, Shiwu Lai
TL;DR
The paper tackles the problem of scale variation in interactive segmentation by introducing MST, an adaptive multi-scale token framework that selects informative tokens from $8\times8$, $16\times16$, and $28\times28$ patches to guide base ViT tokens. It uses a differentiable top-$k$ mechanism to pick tokens based on similarity to a kernel derived from user clicks and fuses tokens via Cross Attention, while a triplet-like contrastive loss $\mathcal{L}_{c}$ improves token discrimination against background tokens. The approach yields state-of-the-art results across standard benchmarks and generalizes well to remote sensing data, with notable reductions in the Number of Clicks $NoC$ and strong mask-correction performance on DAVIS-585. The work provides a practical, scalable solution for interactive segmentation with robust handling of multi-scale targets and contributes to reproducibility through released code and demos.
Abstract
Interactive segmentation has gained significant attention for its application in human-computer interaction and data annotation. To address the target scale variation issue in interactive segmentation, a novel multi-scale token adaptation algorithm is proposed. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified while ensuring performance. To enhance the robustness of multi-scale token selection, we also propose a token learning algorithm based on contrastive loss. This algorithm can effectively improve the performance of multi-scale token adaptation. Extensive benchmarking shows that the algorithm achieves state-of-the-art (SOTA) performance, compared to current methods. An interactive demo and all reproducible codes will be released at https://github.com/hahamyt/mst.
