MUSTAN: Multi-scale Temporal Context as Attention for Robust Video Foreground Segmentation

Praveen Kumar Pokala; Jaya Sai Kiran Patibandla; Naveen Kumar Pandey; Balakrishna Reddy Pailla

MUSTAN: Multi-scale Temporal Context as Attention for Robust Video Foreground Segmentation

Praveen Kumar Pokala, Jaya Sai Kiran Patibandla, Naveen Kumar Pandey, Balakrishna Reddy Pailla

TL;DR

The paper addresses robust video foreground segmentation (VFS) under out-of-domain conditions by introducing MUSTAN1 and MUSTAN2, two encoder-decoder architectures that incorporate multi-scale temporal context as attention to improve representation learning. A new Indoor Surveillance Dataset (ISD) with frame-level foreground, depth, normal, and instance annotations is proposed, generated via a Unreal Engine 5-based data pipeline to diversify indoor scenes. Experimental results show that MUSTAN1 and MUSTAN2 achieve superior out-of-domain performance compared to strong baselines on CDnet-2014 and SBI-2015, with MUSTAN2 attaining notable gains and requiring no extra input annotations. The work demonstrates the value of temporal modeling for VFS, offers a practical synthetic dataset for broader computer vision tasks, and points to future exploration of depth cues alongside temporal information.

Abstract

Video foreground segmentation (VFS) is an important computer vision task wherein one aims to segment the objects under motion from the background. Most of the current methods are image-based, i.e., rely only on spatial cues while ignoring motion cues. Therefore, they tend to overfit the training data and don't generalize well to out-of-domain (OOD) distribution. To solve the above problem, prior works exploited several cues such as optical flow, background subtraction mask, etc. However, having a video data with annotations like optical flow is a challenging task. In this paper, we utilize the temporal information and the spatial cues from the video data to improve OOD performance. However, the challenge lies in how we model the temporal information given the video data in an interpretable way creates a very noticeable difference. We therefore devise a strategy that integrates the temporal context of the video in the development of VFS. Our approach give rise to deep learning architectures, namely MUSTAN1 and MUSTAN2 and they are based on the idea of multi-scale temporal context as an attention, i.e., aids our models to learn better representations that are beneficial for VFS. Further, we introduce a new video dataset, namely Indoor Surveillance Dataset (ISD) for VFS. It has multiple annotations on a frame level such as foreground binary mask, depth map, and instance semantic annotations. Therefore, ISD can benefit other computer vision tasks. We validate the efficacy of our architectures and compare the performance with baselines. We demonstrate that proposed methods significantly outperform the benchmark methods on OOD. In addition, the performance of MUSTAN2 is significantly improved on certain video categories on OOD data due to ISD.

MUSTAN: Multi-scale Temporal Context as Attention for Robust Video Foreground Segmentation

TL;DR

Abstract

Paper Structure (18 sections, 4 equations, 8 figures, 8 tables)

This paper contains 18 sections, 4 equations, 8 figures, 8 tables.

Introduction
Related Work
Motivation and Contribution
Motivation
Our Contribution
Indoor Surveillance Dataset (ISD)
Data Generation Pipeline
Dataset Description
Proposed Methodology
MUSTAN1
MUSTAN2
Experimental Results and Analysis
Baselines
Datasets
Evaluation Metrics
...and 3 more sections

Figures (8)

Figure 1: ISD Pipeline that takes 3D environment along with assets like humans as input and outputs synthetic RGB image with various annotations such as binary foreground mask, normal map, depth map, and instance semantic map.
Figure 2: Visualization of sample backgrounds developed in ISD (top row). Visualization of Sample images from ISD with their annotations such as binary foreground mask, normal map, depth map, and instance semantic map (bottom two rows).
Figure 3: MUSTAN1 architecture. BN stands for batchnorm layer, "Conv" denotes convolutional layer, CNet denotes Context Network, FNet stands for Feature Network, FRM stands for Feature Refinement Module, and RLIM denotes Refine Localization Information Module.
Figure 4: Feature Refinement Module (FRM). CONV stands for convolution layer, BN stands for batch normalization layer, $+$ denotes element wise addition, and $\cdot$ denotes element wise multiplication.
Figure 5: Refine Localization Information Module (RLIM). LRE stands for low resolution embedding, HRE stands for high resolution embedding.
...and 3 more figures

MUSTAN: Multi-scale Temporal Context as Attention for Robust Video Foreground Segmentation

TL;DR

Abstract

MUSTAN: Multi-scale Temporal Context as Attention for Robust Video Foreground Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)