Table of Contents
Fetching ...

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer

TL;DR

This work tackles interactive image segmentation by leveraging plain Vision Transformer backbones instead of traditional hierarchical networks. It introduces SimpleClick, which fuses user clicks via a symmetric patch-embedding layer and employs a simple, single-scale ViT backbone with a lightweight multi-scale feature pyramid and all-MLP decoder, all pretrained with MAE. The approach achieves state-of-the-art performance on natural-image benchmarks (notably NoC@90 on SBD) and generalizes well to medical images without finetuning, with a tiny variant demonstrating practical efficiency. Overall, SimpleClick establishes plain-ViT backbones as strong baselines for interactive segmentation and provides a practical annotation tool with strong compute characteristics.

Abstract

Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We further develop an extremely tiny ViT backbone for SimpleClick and provide a detailed computational analysis, highlighting its suitability as a practical annotation tool.

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

TL;DR

This work tackles interactive image segmentation by leveraging plain Vision Transformer backbones instead of traditional hierarchical networks. It introduces SimpleClick, which fuses user clicks via a symmetric patch-embedding layer and employs a simple, single-scale ViT backbone with a lightweight multi-scale feature pyramid and all-MLP decoder, all pretrained with MAE. The approach achieves state-of-the-art performance on natural-image benchmarks (notably NoC@90 on SBD) and generalizes well to medical images without finetuning, with a tiny variant demonstrating practical efficiency. Overall, SimpleClick establishes plain-ViT backbones as strong baselines for interactive segmentation and provides a practical annotation tool with strong compute characteristics.

Abstract

Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We further develop an extremely tiny ViT backbone for SimpleClick and provide a detailed computational analysis, highlighting its suitability as a practical annotation tool.
Paper Structure (19 sections, 9 figures, 7 tables)

This paper contains 19 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Interactive segmentation results on SBD hariharan2011semantic. The metric "NoC@90" denotes the number of clicks required to obtain 90% IoU. The area of each bubble is proportional to the FLOPs of a model variant (Tab. \ref{['tab:computation_analysis']}). We show that plain ViTs outperform all hierarchical backbones for interactive image segmentation at a moderate computational cost.
  • Figure 2: SimpleClick overview. Our method consists of three main modules: (a) a plain ViT backbone that maintains single-scale feature maps throughout; (b) a multi-scale simple feature pyramid that is generated from the last feature map of the backbone by four parallel convolution or deconvolution layers; (c) a light-weight MLP decoder for segmentation. We also input the previous segmentation to improve the performance and to allow clicking from existing masks. User clicks are encoded as a two-channel disk map, combined with the previous segmentation as input. Boxes in blue are intermediate feature maps. The positional encoding is not shown for brevity.
  • Figure 3: Convergence analysis for models trained on either SBD hariharan2011semantic or COCO lin2014microsoft+LVIS gupta2019lvis (C+L). We report results on SBD hariharan2011semantic and Pascal VOC everingham2010pascal. The metric is mean IoU given $k$ clicks (mIoU@$k$). Our models in general require fewer clicks for a given accuracy level.
  • Figure 4: Histogram analysis of IoU given a predefined number of clicks $k$ (IoU@k). We report analysis on SBD hariharan2011semantic and Pascal VOC everingham2010pascal with models trained on SBD. Compared with the two baselines, our models achieve higher-quality segmentation with fewer failure cases.
  • Figure 5: Convergence analysis for two medical image datasets: BraTS baid2021rsna and OAIZIB ambellan2019automated. Models are trained on either SBD hariharan2011semantic or COCO lin2014microsoft+LVIS gupta2019lvis (denoted as C+L). The metric is mean IoU given $k$ clicks. Overall, our models require fewer clicks for a given accuracy level. The performance gain is more prominent for bigger models (e.g. ViT-H) or larger training sets (e.g. C+L).
  • ...and 4 more figures