Table of Contents
Fetching ...

Exploiting Temporal State Space Sharing for Video Semantic Segmentation

Syed Ariff Syed Hesham, Yun Liu, Guolei Sun, Henghui Ding, Jing Yang, Ender Konukoglu, Xue Geng, Xudong Jiang

TL;DR

The paper tackles video semantic segmentation (VSS) by addressing the limited temporal context and high memory costs of frame-based or short-window approaches. It introduces Temporal Video State Space Sharing (TV3S), which leverages Vision Mamba-based state space models to propagate temporal information across frames while processing spatial patches in parallel and employing a shifted window mechanism for boundary motion. Key contributions include a TV3S block design with two TSS modules (unshifted and shifted), a patch-based parallel processing paradigm, and a training/inference strategy that preserves long-range temporal coherence with reduced resource demands. Empirical results on VSPW and Cityscapes show state-of-the-art or near-state-of-the-art performance with favorable efficiency, underscoring TV3S as a practical advance for scalable, temporally-aware VSS.

Abstract

Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at https://github.com/Ashesham/TV3S.git.

Exploiting Temporal State Space Sharing for Video Semantic Segmentation

TL;DR

The paper tackles video semantic segmentation (VSS) by addressing the limited temporal context and high memory costs of frame-based or short-window approaches. It introduces Temporal Video State Space Sharing (TV3S), which leverages Vision Mamba-based state space models to propagate temporal information across frames while processing spatial patches in parallel and employing a shifted window mechanism for boundary motion. Key contributions include a TV3S block design with two TSS modules (unshifted and shifted), a patch-based parallel processing paradigm, and a training/inference strategy that preserves long-range temporal coherence with reduced resource demands. Empirical results on VSPW and Cityscapes show state-of-the-art or near-state-of-the-art performance with favorable efficiency, underscoring TV3S as a practical advance for scalable, temporally-aware VSS.

Abstract

Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at https://github.com/Ashesham/TV3S.git.

Paper Structure

This paper contains 16 sections, 4 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of the proposed TV3S with baseline models for VSS. By enhancing temporal information, our TV3S demonstrates superior performance over the baselines.
  • Figure 2: Overview of the proposed TV3S architecture, illustrating the encoder-decoder framework that employs state space models and our TSS module based on Mamba mamba_paper for independent spatial and temporal processing.
  • Figure 3: Internal structure of the TV3S block, illustrating the flow of internal operations, and the propagation of hidden states for efficient spatiotemporal integration.
  • Figure 4: Qualitative example of our TV3S architecture compared with the current baseline. This displays both improved performance in performing spatial predictions and utilizing the temporal information to produce temporally consistent segmentation results.