Table of Contents
Fetching ...

Semantic Segmentation of Video Sequences with Convolutional LSTMs

Andreas Pfeuffer, Karina Schulz, Klaus Dietmayer

TL;DR

This work extends the encoder-decoder network architecture of well-known segmentation approaches and place convolutional LSTM layers between encoder and decoder at different positions and evaluated on two different datasets in order to find the best one.

Abstract

Most of the semantic segmentation approaches have been developed for single image segmentation, and hence, video sequences are currently segmented by processing each frame of the video sequence separately. The disadvantage of this is that temporal image information is not considered, which improves the performance of the segmentation approach. One possibility to include temporal information is to use recurrent neural networks. However, there are only a few approaches using recurrent networks for video segmentation so far. These approaches extend the encoder-decoder network architecture of well-known segmentation approaches and place convolutional LSTM layers between encoder and decoder. However, in this paper it is shown that this position is not optimal, and that other positions in the network exhibit better performance. Nowadays, state-of-the-art segmentation approaches rarely use the classical encoder-decoder structure, but use multi-branch architectures. These architectures are more complex, and hence, it is more difficult to place the recurrent units at a proper position. In this work, the multi-branch architectures are extended by convolutional LSTM layers at different positions and evaluated on two different datasets in order to find the best one. It turned out that the proposed approach outperforms the pure CNN-based approach for up to 1.6 percent.

Semantic Segmentation of Video Sequences with Convolutional LSTMs

TL;DR

This work extends the encoder-decoder network architecture of well-known segmentation approaches and place convolutional LSTM layers between encoder and decoder at different positions and evaluated on two different datasets in order to find the best one.

Abstract

Most of the semantic segmentation approaches have been developed for single image segmentation, and hence, video sequences are currently segmented by processing each frame of the video sequence separately. The disadvantage of this is that temporal image information is not considered, which improves the performance of the segmentation approach. One possibility to include temporal information is to use recurrent neural networks. However, there are only a few approaches using recurrent networks for video segmentation so far. These approaches extend the encoder-decoder network architecture of well-known segmentation approaches and place convolutional LSTM layers between encoder and decoder. However, in this paper it is shown that this position is not optimal, and that other positions in the network exhibit better performance. Nowadays, state-of-the-art segmentation approaches rarely use the classical encoder-decoder structure, but use multi-branch architectures. These architectures are more complex, and hence, it is more difficult to place the recurrent units at a proper position. In this work, the multi-branch architectures are extended by convolutional LSTM layers at different positions and evaluated on two different datasets in order to find the best one. It turned out that the proposed approach outperforms the pure CNN-based approach for up to 1.6 percent.

Paper Structure

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Segmentation map of a video sequence yielded by the ICNet. The black boxes show typical errors such as partly segmented objects, flickering edges, and flickering (ghost) objects
  • Figure 2: LSTM-SegNet network architecture: the gray boxes illustrate the original SegNet architecture, while the different positions of the ConvLSTM layers are illustrated in color.
  • Figure 3: LSTM-ICNet network architecture: the gray boxes illustrate the original ICNet architecture, while the different positions of the ConvLSTM layers are illustrated in color.
  • Figure 4: Qualitative results of the proposed approaches over a video clip of the Cityscapes dataset. The first row contains the RGB images, the next four rows the segmentation map of the SegNet-based approaches and the last seven rows the results of the ICNet-based approaches.