Table of Contents
Fetching ...

Learning Temporal Distribution and Spatial Correlation Towards Universal Moving Object Segmentation

Guanfang Dong, Chenqiu Zhao, Xichen Pan, Anup Basu

TL;DR

This paper proposes a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation in real world environments.

Abstract

The goal of moving object segmentation is separating moving objects from stationary backgrounds in videos. One major challenge in this problem is how to develop a universal model for videos from various natural scenes since previous methods are often effective only in specific scenes. In this paper, we propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation. In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for a scene-independent segmentation. Notably, the DIDL network incorporates the use of an improved product distribution layer that we have newly derived. Then, the Stochastic Bayesian Refinement (SBR) Network, which learns the spatial correlation, is proposed to improve the binary mask generated by the DIDL network. Benefiting from the scene independence of the temporal distribution and the accuracy improvement resulting from the spatial correlation, the proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters. Comprehensive experiments on standard datasets including LASIESTA, CDNet2014, BMC, SBMI2015 and 128 real world videos demonstrate the superiority of proposed approach compared to state-of-the-art methods with or without the use of deep learning networks. To the best of our knowledge, this work has high potential to be a general solution for moving object segmentation in real world environments. The code and real-world videos can be found on GitHub https://github.com/guanfangdong/LTS-UniverisalMOS.

Learning Temporal Distribution and Spatial Correlation Towards Universal Moving Object Segmentation

TL;DR

This paper proposes a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation in real world environments.

Abstract

The goal of moving object segmentation is separating moving objects from stationary backgrounds in videos. One major challenge in this problem is how to develop a universal model for videos from various natural scenes since previous methods are often effective only in specific scenes. In this paper, we propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation. In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for a scene-independent segmentation. Notably, the DIDL network incorporates the use of an improved product distribution layer that we have newly derived. Then, the Stochastic Bayesian Refinement (SBR) Network, which learns the spatial correlation, is proposed to improve the binary mask generated by the DIDL network. Benefiting from the scene independence of the temporal distribution and the accuracy improvement resulting from the spatial correlation, the proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters. Comprehensive experiments on standard datasets including LASIESTA, CDNet2014, BMC, SBMI2015 and 128 real world videos demonstrate the superiority of proposed approach compared to state-of-the-art methods with or without the use of deep learning networks. To the best of our knowledge, this work has high potential to be a general solution for moving object segmentation in real world environments. The code and real-world videos can be found on GitHub https://github.com/guanfangdong/LTS-UniverisalMOS.
Paper Structure (14 sections, 17 equations, 6 figures, 7 tables)

This paper contains 14 sections, 17 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustration of the possibility of proposing a universal method for videos from diverse scenes. Although the scene information from different videos is completely different, the distributions of temporal pixels belonging to foreground or background are similar.
  • Figure 2: The model structure of the DIDL network. N: batch size, Filters: number of learned kernels, Conv: convolutional layers, in: number of input channels, out: number of output channels, FC: fully connected layers. As some of the training data contain shadows and boundaries, the final output of the model is $N \times 3$, where black, white, and gray represent the background label, foreground label, and other labels, respectively.
  • Figure 5: The framework of the Stochastic Bayesian Refinement network. To generate a refined foreground, the DIDL foreground output is merged with the corresponding image to a $4 \times H \times W$ tensor. The tensor is then sampled $n$ times at varying scales to produce refined foreground patches, which are subsequently stacked to generate a heatmap. A threshold applied to this heatmap determines the refinement outcome.
  • Figure 6: The structure of the Refine Block in the SBR network. We use a $64\times64$ patch as an example, which can actually be used for patch sizes larger than $8\times8$. DoubleConv represents a block consisting of two convolutional layers and an Relu activation layer combined. The numbers on the left and right of the arrows $\rightarrow$ represent the number of input and output channels, respectively. Concat represents concatenation.
  • Figure 7: Sample results from our 128 newly captured videos. The segmented moving objects are highlighted in red. The video results can be found in the supplementary materials and on the YouTube link https://youtu.be/BcLnNTne-n0. We can observe that the scene information in newly captured videos is highly complex, and there is a wide variety of moving objects in terms of both types and quantity.
  • ...and 1 more figures