Recurrent Neural Networks for Semantic Instance Segmentation
Amaia Salvador, Miriam Bellver, Victor Campos, Manel Baradad, Ferran Marques, Jordi Torres, Xavier Giro-i-Nieto
TL;DR
This work introduces RSIS, an end-to-end recurrent network that directly maps image pixels to a variable-length sequence of semantic instance masks and class labels. It uses a ResNet-101 encoder and a hierarchical ConvLSTM decoder with skip connections to generate one object per time step, along with bounding boxes, class probabilities, and a stop signal. Training employs a multi-task loss with Hungarian matching to align predictions with ground truth, plus curriculum learning for scenes with many objects. Experiments on Pascal VOC 2012, CVPPP, and Cityscapes show competitive performance against prior sequential methods and reveal interpretable object-discovery patterns linked to encoder activations. This approach eliminates post-processing, enabling end-to-end optimization for semantic instance segmentation.
Abstract
We present a recurrent model for semantic instance segmentation that sequentially generates binary masks and their associated class probabilities for every object in an image. Our proposed system is trainable end-to-end from an input image to a sequence of labeled masks and, compared to methods relying on object proposals, does not require post-processing steps on its output. We study the suitability of our recurrent model on three different instance segmentation benchmarks, namely Pascal VOC 2012, CVPPP Plant Leaf Segmentation and Cityscapes. Further, we analyze the object sorting patterns generated by our model and observe that it learns to follow a consistent pattern, which correlates with the activations learned in the encoder part of our network. Source code and models are available at https://imatge-upc.github.io/rsis/
