Table of Contents
Fetching ...

M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation

Markus Karmann, Peng-Tao Jiang, Bo Li, Onay Urfalioglu

TL;DR

M2N2V2 tackles unsupervised, training-free interactive segmentation by fusing high-resolution depth guidance with attention-based Markov-maps. The method introduces depth-guided Markov-maps, depth-integrated JBU/flood fill, and an adaptive segment-size scoring function to stabilize segmentation during user prompts, all without any labeled data. Empirically, M2N2V2 yields substantial reductions in Number of Clicks and improvements in mIoU over its predecessor across most non-medical datasets, and demonstrates competitive performance with supervised methods in several challenging datasets, though depth is less informative on medical data. The approach offers a practical, drop-in framework with multi-modal cues and robust behavior, supported by a public code release for broader adoption.

Abstract

We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.

M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation

TL;DR

M2N2V2 tackles unsupervised, training-free interactive segmentation by fusing high-resolution depth guidance with attention-based Markov-maps. The method introduces depth-guided Markov-maps, depth-integrated JBU/flood fill, and an adaptive segment-size scoring function to stabilize segmentation during user prompts, all without any labeled data. Empirically, M2N2V2 yields substantial reductions in Number of Clicks and improvements in mIoU over its predecessor across most non-medical datasets, and demonstrates competitive performance with supervised methods in several challenging datasets, though depth is less informative on medical data. The approach offers a practical, drop-in framework with multi-modal cues and robust behavior, supported by a public code release for broader adoption.

Abstract

We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.

Paper Structure

This paper contains 23 sections, 10 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: M2N2V2 framework overview. Our contributions are highlighted in blue. The illustration shows an example with $N=3$ prompt points (2 green foreground points, 1 red background point). First, we extract the attention tensor $\boldsymbol{A}$ and depth map $D$ of a given image $I$. Then we use the depth, RGB and attention information to generate Markov-maps $M_i$ for each prompt point $x_i$. Finally, each Markov-map is scaled based on a scoring function utilizing the previous segmentation result $\hat{Y}_{N-1}$ to predict $\hat{Y}_N$.
  • Figure 2: Boundary distance estimation. Two examples with the gray area representing the previous segmentation and the dashed line depicting the area the user wants to include (left) or exclude (right). On the left side, the user places a new foreground prompt point and on the right a background prompt point. The circle around each prompt point is drawn with a radius $r$ of the shortest distance to the previous segmentation boundary (shown as an arrow).
  • Figure 3: M2N2V2 vs. M2N2. Comparison of M2N2V2 with M2N2 on all 10 datasets, showing difference in NoC95 as well as mIoU@3.
  • Figure 4: mIoU per NoC. Using the adaptive segment size score function (blue curve) we achieve a higher and smoother mIoU curve.
  • Figure 5: Example predictions on the HQSeg44K Dataset. Easy to more difficult images are sorted from left to right. Foreground prompt points are shown in green and background prompt points in red. M2N2V2 is able to segment the majority of fine structures successfully.
  • ...and 9 more figures