Table of Contents
Fetching ...

Iteratively Trained Interactive Segmentation

Sabarinath Mahadevan, Paul Voigtlaender, Bastian Leibe

TL;DR

This work presents Iteratively Trained Interactive Segmentation (ITIS), a click-based interactive segmentation framework that uses an iterative training loop to add correction clicks reflecting actual user behavior. By encoding clicks as Gaussians, optionally incorporating a distance-transform mask, and training with an iterative correction strategy, ITIS achieves state-of-the-art performance with fewer interactions on standard benchmarks. The approach also demonstrates robustness to test-time click strategies and proves effective in video mask correction and KITTI instance annotation, indicating practical impact for large-scale annotation workflows.

Abstract

Deep learning requires large amounts of training data to be effective. For the task of object segmentation, manually labeling data is very expensive, and hence interactive methods are needed. Following recent approaches, we develop an interactive object segmentation system which uses user input in the form of clicks as the input to a convolutional network. While previous methods use heuristic click sampling strategies to emulate user clicks during training, we propose a new iterative training strategy. During training, we iteratively add clicks based on the errors of the currently predicted segmentation. We show that our iterative training strategy together with additional improvements to the network architecture results in improved results over the state-of-the-art.

Iteratively Trained Interactive Segmentation

TL;DR

This work presents Iteratively Trained Interactive Segmentation (ITIS), a click-based interactive segmentation framework that uses an iterative training loop to add correction clicks reflecting actual user behavior. By encoding clicks as Gaussians, optionally incorporating a distance-transform mask, and training with an iterative correction strategy, ITIS achieves state-of-the-art performance with fewer interactions on standard benchmarks. The approach also demonstrates robustness to test-time click strategies and proves effective in video mask correction and KITTI instance annotation, indicating practical impact for large-scale annotation workflows.

Abstract

Deep learning requires large amounts of training data to be effective. For the task of object segmentation, manually labeling data is very expensive, and hence interactive methods are needed. Following recent approaches, we develop an interactive object segmentation system which uses user input in the form of clicks as the input to a convolutional network. While previous methods use heuristic click sampling strategies to emulate user clicks during training, we propose a new iterative training strategy. During training, we iteratively add clicks based on the errors of the currently predicted segmentation. We show that our iterative training strategy together with additional improvements to the network architecture results in improved results over the state-of-the-art.

Paper Structure

This paper contains 17 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our method. The input to our network consists of an RGB image concatenated with two click channels representing negative and positive clicks, and also an optional mask channel encoded as distance transform.
  • Figure 3: Mean IoU score against the number of clicks used to achieve it on the PASCAL VOC Everingham10IJCV and GrabCut Rother04SIGGRAPH datasets.
  • Figure 4: Effect of different click sampling strategies at test time. It can be seen that our method generalizes to alternative sampling methods with only a small loss in performance.
  • Figure 5: Ablation study on PASCAL VOC. It can be seen, both from the table on the left and the plot on the right, that the proposed iterative training procedure significantly improves the results.
  • Figure 6: Interactive segmentation performance for segmenting 741 cars on KITTI. For a large range of number of clicks our method performs better than Polygon-RNN although Polygon-RNN uses the ground truth bounding box and requires more manual effort per click.
  • ...and 1 more figures