Learning from Exemplars for Interactive Image Segmentation
Kun Li, Hao Cheng, George Vosselman, Michael Ying Yang
TL;DR
This work addresses interactive image segmentation with multiple same-category objects by learning from a satisfactorily segmented exemplar to guide the recall objects. It introduces iCMFormer and its MOIS extension iCMFormer++, featuring an exemplar-informed module and a lightweight channel fusion scheme within a two-stream transformer backbone. Empirical results on SOIS and MOIS benchmarks, including the extended COCO MOIS dataset, show state-of-the-art performance and notably reduced user effort—around a 15% labor reduction in certain target IoU regimes. The approach demonstrates strong practical potential for scalable and efficient annotation, while leaving room for prompt-based enhancements and broader user studies in future work.
Abstract
Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.
