Table of Contents
Fetching ...

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

Konstantin Sofiiuk, Ilia A. Petrov, Anton Konushin

TL;DR

The paper tackles sample-efficient, interactive segmentation by removing inference-time optimization in favor of a feedforward model that uses masks from previous steps. It demonstrates that a HRNet-OCR backbone with disk-based click encoding and normalized focal loss achieves state-of-the-art results, especially when trained with a COCO+LVIS dataset. The authors show that incorporating the previous-step mask improves stability and enables starting from external masks. They release code and show strong generalization to new objects and practical deployment on limited hardware.

Abstract

Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes. These methods are considerably more computationally expensive compared to feedforward approaches, as they require performing backward passes through a network during inference and are hard to deploy on mobile frameworks that usually support only forward passes. In this paper, we extensively evaluate various design choices for interactive segmentation and discover that new state-of-the-art results can be obtained without any additional optimization schemes. Thus, we propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps. It allows not only to segment an entirely new object, but also to start with an external mask and correct it. When analyzing the performance of models trained on different datasets, we observe that the choice of a training dataset greatly impacts the quality of interactive segmentation. We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models. The code and trained models are available at https://github.com/saic-vul/ritm_interactive_segmentation.

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

TL;DR

The paper tackles sample-efficient, interactive segmentation by removing inference-time optimization in favor of a feedforward model that uses masks from previous steps. It demonstrates that a HRNet-OCR backbone with disk-based click encoding and normalized focal loss achieves state-of-the-art results, especially when trained with a COCO+LVIS dataset. The authors show that incorporating the previous-step mask improves stability and enables starting from external masks. They release code and show strong generalization to new objects and practical deployment on limited hardware.

Abstract

Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes. These methods are considerably more computationally expensive compared to feedforward approaches, as they require performing backward passes through a network during inference and are hard to deploy on mobile frameworks that usually support only forward passes. In this paper, we extensively evaluate various design choices for interactive segmentation and discover that new state-of-the-art results can be obtained without any additional optimization schemes. Thus, we propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps. It allows not only to segment an entirely new object, but also to start with an external mask and correct it. When analyzing the performance of models trained on different datasets, we observe that the choice of a training dataset greatly impacts the quality of interactive segmentation. We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models. The code and trained models are available at https://github.com/saic-vul/ritm_interactive_segmentation.

Paper Structure

This paper contains 15 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Besides segmenting new objects, proposed method allows to correct external masks, e.g. produced by other instance or semantic segmentation models. A user can fix false negative and false positive regions with positive (green) and negative (red) clicks, respectively.
  • Figure 2: Visualization of two different approaches for encoding user clicks.
  • Figure 3: Different architecture choices of feeding encoded clicks to a backbone, described in Section \ref{['sec:revising_network_and_brs']}.
  • Figure 4: Mean IoU@$k$ for varying number of clicks $k$ on GrabCut, Berkeley, DAVIS and SBD. The iterative model that takes a mask from a previous step is much more stable and converges to a better IoU. All the results are reported for the model with the HRNet-18+OCR backbone trained on COCO+LVIS, iterative models are trained with $N_{iters}=3$.
  • Figure 5: Visualization of interactive segmentation for the Berkeley images with a different number of clicks fed to the HRNet-18 ITER-M model and obtained by the NoC evaluation procedure xu2016deep. Green and red dots denote positive and negative clicks, respectively. There are only 2 images from Berkeley on which our model does not converge to 90% IoU in 20 clicks. One of them is shown in the third row.
  • ...and 1 more figures