Table of Contents
Fetching ...

Recurrent Neural Networks for Semantic Instance Segmentation

Amaia Salvador, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, Xavier Giro-i-Nieto

TL;DR

This work introduces RSIS, an end-to-end recurrent network that directly maps image pixels to a variable-length sequence of semantic instance masks and class labels. It uses a ResNet-101 encoder and a hierarchical ConvLSTM decoder with skip connections to generate one object per time step, along with bounding boxes, class probabilities, and a stop signal. Training employs a multi-task loss with Hungarian matching to align predictions with ground truth, plus curriculum learning for scenes with many objects. Experiments on Pascal VOC 2012, CVPPP, and Cityscapes show competitive performance against prior sequential methods and reveal interpretable object-discovery patterns linked to encoder activations. This approach eliminates post-processing, enabling end-to-end optimization for semantic instance segmentation.

Abstract

We present a recurrent model for semantic instance segmentation that sequentially generates binary masks and their associated class probabilities for every object in an image. Our proposed system is trainable end-to-end from an input image to a sequence of labeled masks and, compared to methods relying on object proposals, does not require post-processing steps on its output. We study the suitability of our recurrent model on three different instance segmentation benchmarks, namely Pascal VOC 2012, CVPPP Plant Leaf Segmentation and Cityscapes. Further, we analyze the object sorting patterns generated by our model and observe that it learns to follow a consistent pattern, which correlates with the activations learned in the encoder part of our network. Source code and models are available at https://imatge-upc.github.io/rsis/

Recurrent Neural Networks for Semantic Instance Segmentation

TL;DR

This work introduces RSIS, an end-to-end recurrent network that directly maps image pixels to a variable-length sequence of semantic instance masks and class labels. It uses a ResNet-101 encoder and a hierarchical ConvLSTM decoder with skip connections to generate one object per time step, along with bounding boxes, class probabilities, and a stop signal. Training employs a multi-task loss with Hungarian matching to align predictions with ground truth, plus curriculum learning for scenes with many objects. Experiments on Pascal VOC 2012, CVPPP, and Cityscapes show competitive performance against prior sequential methods and reveal interpretable object-discovery patterns linked to encoder activations. This approach eliminates post-processing, enabling end-to-end optimization for semantic instance segmentation.

Abstract

We present a recurrent model for semantic instance segmentation that sequentially generates binary masks and their associated class probabilities for every object in an image. Our proposed system is trainable end-to-end from an input image to a sequence of labeled masks and, compared to methods relying on object proposals, does not require post-processing steps on its output. We study the suitability of our recurrent model on three different instance segmentation benchmarks, namely Pascal VOC 2012, CVPPP Plant Leaf Segmentation and Cityscapes. Further, we analyze the object sorting patterns generated by our model and observe that it learns to follow a consistent pattern, which correlates with the activations learned in the encoder part of our network. Source code and models are available at https://imatge-upc.github.io/rsis/

Paper Structure

This paper contains 13 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our proposed recurrent architecture for semantic instance segmentation.
  • Figure 2: Examples of generated output sequences for the three datasets.
  • Figure 3: (a) False positive distribution. (b-d) Error analysis on Pascal VOC (blue) and Cityscapes (green): (b) IoU vs time step, (c) False negative size distribution, (d) IoU vs object size (object size given as the image % it covers). Reported values in (a) and (d) are constrained to the particularities of each dataset (object sequences for Pascal VOC are shorter and objects in Cityscapes are smaller).
  • Figure 4: Examples of predicted object sequences for images in Pascal VOC 2012 validation set that highly correlate with the different sorting strategies.
  • Figure 5: Percentage of consecutive object pairs of different categories that follow a particular sorting pattern.
  • ...and 1 more figures