Table of Contents
Fetching ...

Learning from Exemplars for Interactive Image Segmentation

Kun Li, Hao Cheng, George Vosselman, Michael Ying Yang

TL;DR

This work addresses interactive image segmentation with multiple same-category objects by learning from a satisfactorily segmented exemplar to guide the recall objects. It introduces iCMFormer and its MOIS extension iCMFormer++, featuring an exemplar-informed module and a lightweight channel fusion scheme within a two-stream transformer backbone. Empirical results on SOIS and MOIS benchmarks, including the extended COCO MOIS dataset, show state-of-the-art performance and notably reduced user effort—around a 15% labor reduction in certain target IoU regimes. The approach demonstrates strong practical potential for scalable and efficient annotation, while leaving room for prompt-based enhancements and broader user studies in future work.

Abstract

Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.

Learning from Exemplars for Interactive Image Segmentation

TL;DR

This work addresses interactive image segmentation with multiple same-category objects by learning from a satisfactorily segmented exemplar to guide the recall objects. It introduces iCMFormer and its MOIS extension iCMFormer++, featuring an exemplar-informed module and a lightweight channel fusion scheme within a two-stream transformer backbone. Empirical results on SOIS and MOIS benchmarks, including the extended COCO MOIS dataset, show state-of-the-art performance and notably reduced user effort—around a 15% labor reduction in certain target IoU regimes. The approach demonstrates strong practical potential for scalable and efficient annotation, while leaving room for prompt-based enhancements and broader user studies in future work.

Abstract

Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.
Paper Structure (29 sections, 5 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Current interactive segmentation methods (red arrows) designed for a single object require several rounds of interactions. In each round, users need to provide a set of positive or negative clicks to indicate the target. (b) Our interactive segmentation method (green arrows) predicts the mask of multiple objects in the same category by leveraging the previously interacted object.
  • Figure 2: Overall iCMFormer and iCMFormer++ frameworks. The prediction of an object by the iCMFormer serves as the exemplar for our iCMFormer++ model, which is denoted in a red dotted line. The iCMFormer++ model is built on top of the iCMFormer model with several modifications. Specifically, We first adopt a two-stream transformer backbone to extract the features from the previously obtained exemplar (denoted as exemplar branch with green arrows) and the overall image with additional interactions (denoted as recall branch with blue arrows). An exemplar-informed module is employed to learn the similarity between the exemplar and other potential image regions. Finally, a channel fusion module processes the cross-attended features before the final segmentation for all objects in the same category. For brevity, we remove the position embedding (PE) and the upsampling step in the illustration of iCMFormer++.
  • Figure 3: Architecture of the proposed exemplar-informed module (EIM). A pre-trained ResNet he2016resnet is employed as the extractor for the visual representation. The Conv kernel obtains the point-wise feature comparison by convolving the projected features $\text{Feat}^p_o$ with kernels from $\text{Feat}^p_e$. With a normalization layer, the EIM outputs the response activation vector denoted as $R^n_f$.
  • Figure 4: Illustration of the proposed channel embedding fusion module. We feed the module with cross-attended features from the exemplar branch and the recall branch, represented as $\text{F}_e$ and $\text{F}_r$. The DWConv denotes a depth-wise convolution layer. The module outputs the fused feature $\text{F}_f$ before the final segmentation head.
  • Figure 5: Convergence analysis of the mean IoU (mIoU$\circledast k$) curves for varying number of clicks. The evaluation results on GrabCut rother2004grabcut, Berkeley mcguinness2010berkeley, SBD hariharan2011sbd, COCO MVal, COCO MOIS and HIM2K sun2022humanvmultiple are provided. The higher starting point typically leads to better results with the first positive click. A steeper slope indicates that the method requires fewer clicks to achieve better segmentation results.
  • ...and 5 more figures