Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving
Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag
TL;DR
The paper tackles temporal inconsistency in video semantic segmentation for automated driving by introducing Spatio-Temporal Attention (STA), which integrates multi-frame context directly into transformer attention. STA mechanisms extend the current-frame query with aggregated keys/values from previous frames using a decay factor, enabling robust cross-frame feature fusion with minimal architectural disruption. Across Cityscapes and BDD100k, STA applied to SegFormer and UMixFormer backbones yields up to 9.20 percentage points improvements in temporal consistency (mTC) and up to 1.76 percentage points in mean IoU (mIoU), with ablations identifying $T=3$ as optimal and a modest FLOPs overhead of ~18–23%. The approach is practical for real-time autonomous driving due to its general applicability, scalability across model sizes, and avoidance of explicit optical-flow estimation, marking a meaningful step toward unified spatio-temporal reasoning in vision transformers.
Abstract
Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
