Table of Contents
Fetching ...

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag

TL;DR

The paper tackles temporal inconsistency in video semantic segmentation for automated driving by introducing Spatio-Temporal Attention (STA), which integrates multi-frame context directly into transformer attention. STA mechanisms extend the current-frame query with aggregated keys/values from previous frames using a decay factor, enabling robust cross-frame feature fusion with minimal architectural disruption. Across Cityscapes and BDD100k, STA applied to SegFormer and UMixFormer backbones yields up to 9.20 percentage points improvements in temporal consistency (mTC) and up to 1.76 percentage points in mean IoU (mIoU), with ablations identifying $T=3$ as optimal and a modest FLOPs overhead of ~18–23%. The approach is practical for real-time autonomous driving due to its general applicability, scalability across model sizes, and avoidance of explicit optical-flow estimation, marking a meaningful step toward unified spatio-temporal reasoning in vision transformers.

Abstract

Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.

Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

TL;DR

The paper tackles temporal inconsistency in video semantic segmentation for automated driving by introducing Spatio-Temporal Attention (STA), which integrates multi-frame context directly into transformer attention. STA mechanisms extend the current-frame query with aggregated keys/values from previous frames using a decay factor, enabling robust cross-frame feature fusion with minimal architectural disruption. Across Cityscapes and BDD100k, STA applied to SegFormer and UMixFormer backbones yields up to 9.20 percentage points improvements in temporal consistency (mTC) and up to 1.76 percentage points in mean IoU (mIoU), with ablations identifying as optimal and a modest FLOPs overhead of ~18–23%. The approach is practical for real-time autonomous driving due to its general applicability, scalability across model sizes, and avoidance of explicit optical-flow estimation, marking a meaningful step toward unified spatio-temporal reasoning in vision transformers.

Abstract

Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
Paper Structure (14 sections, 9 equations, 3 figures, 2 tables)

This paper contains 14 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Examples of stable and unstable predictions. The yellow box highlights the area of interest in the images. Top: From left to right, we have three consecutive frames of a video sequence from $t-2$ to $t$. Center: The predictions of a semantic segmentation model without our spatio-temporal attention module (STA), where the motorcycle and bicycles are not consistent over time (both $t\!\!-\!\!2\rightarrow t\!\!-\!\!1$, and $t\!\!-\!\!1\rightarrow t$). Bottom: STA focuses on improving temporal consistency of predictions of semantic segmentation networks over time. With our approach we observe an improvement in the robustness of the prediction in the highlighted area.
  • Figure 2: Hierarchical illustration of our proposed Spatio-Temporal Attention (STA) module. The figure shows three levels of detail. Left: Transformer-based segmentation architecture with MiT encoder stages processing multi-scale features. Center: Individual transformer block with Multi-Head Self-Attention (MSA) containing multiple STA-Heads, followed by Mix-FFN. Right: Detailed STA computation that extends standard attention across temporal sequence $\{\mathbf{x}_{t-T+1}, \ldots, \mathbf{x}_t\}$, enabling cross-frame feature aggregation while preserving spatial relationships.
  • Figure 3: Temporal context ablation study showing performance of STA-UMixFormer B0 on the Cityscapes dataset across different temporal context lengths $T$. The value of $T=1$ corresponds to the single frame procedure.