Table of Contents
Fetching ...

Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers

Alberto Gonzalo Rodriguez Salgado, Maying Shen, Philipp Harzig, Peter Mayer, Jose M. Alvarez

TL;DR

This work tackles the robustness of Vision Transformers for semantic segmentation under corruptions by introducing Channel-Wise Feature Augmentation (CWFA), a training-time augmentation that perturbs encoder features using a globally estimated channel-wise perturbation derived from a global average feature. CWFA is a lightweight, plug-in mechanism that computes $ oldsymbol{p} = oldsymbol{x}_i / orm{oldsymbol{x}_i}_2$ with a perturbation strength $ oldsymbol{ ext{epsilon}}$ and applies it across all spatial positions in each channel with probability $p_{ ext{augm}}$, leveraging the global attention in ViTs to capture global perturbations efficiently. Empirically, CWFA yields substantial robustness gains across SegFormer, Swin, and Twins on Cityscapes and ADE20K, achieving up to 27.7% mIoU improvement under impulse noise for SegFormer-B1 and setting a new state-of-the-art 84.3% retention for SegFormer-B5 on Cityscapes-C, while incurring only ~2% extra training time and no inference cost increase. The method consistently outperforms image-space augmentations like AugMix and feature-space baselines such as SFA, demonstrating strong generalization across architectures and datasets and offering a practical, scalable path to robust semantic segmentation with Transformers.

Abstract

Robustness to out-of-distribution data is crucial for deploying modern neural networks. Recently, Vision Transformers, such as SegFormer for semantic segmentation, have shown impressive robustness to visual corruptions like blur or noise affecting the acquisition device. In this paper, we propose Channel Wise Feature Augmentation (CWFA), a simple yet efficient feature augmentation technique to improve the robustness of Vision Transformers for semantic segmentation. CWFA applies a globally estimated perturbation per encoder with minimal compute overhead during training. Extensive evaluations on Cityscapes and ADE20K, with three state-of-the-art Vision Transformer architectures : SegFormer, Swin Transformer, and Twins demonstrate that CWFA-enhanced models significantly improve robustness without affecting clean data performance. For instance, on Cityscapes, a CWFA-augmented SegFormer-B1 model yields up to 27.7% mIoU robustness gain on impulse noise compared to the non-augmented SegFormer-B1. Furthermore, CWFA-augmented SegFormer-B5 achieves a new state-of-the-art 84.3% retention rate, a 0.7% improvement over the recently published FAN+STL.

Global Average Feature Augmentation for Robust Semantic Segmentation with Transformers

TL;DR

This work tackles the robustness of Vision Transformers for semantic segmentation under corruptions by introducing Channel-Wise Feature Augmentation (CWFA), a training-time augmentation that perturbs encoder features using a globally estimated channel-wise perturbation derived from a global average feature. CWFA is a lightweight, plug-in mechanism that computes with a perturbation strength and applies it across all spatial positions in each channel with probability , leveraging the global attention in ViTs to capture global perturbations efficiently. Empirically, CWFA yields substantial robustness gains across SegFormer, Swin, and Twins on Cityscapes and ADE20K, achieving up to 27.7% mIoU improvement under impulse noise for SegFormer-B1 and setting a new state-of-the-art 84.3% retention for SegFormer-B5 on Cityscapes-C, while incurring only ~2% extra training time and no inference cost increase. The method consistently outperforms image-space augmentations like AugMix and feature-space baselines such as SFA, demonstrating strong generalization across architectures and datasets and offering a practical, scalable path to robust semantic segmentation with Transformers.

Abstract

Robustness to out-of-distribution data is crucial for deploying modern neural networks. Recently, Vision Transformers, such as SegFormer for semantic segmentation, have shown impressive robustness to visual corruptions like blur or noise affecting the acquisition device. In this paper, we propose Channel Wise Feature Augmentation (CWFA), a simple yet efficient feature augmentation technique to improve the robustness of Vision Transformers for semantic segmentation. CWFA applies a globally estimated perturbation per encoder with minimal compute overhead during training. Extensive evaluations on Cityscapes and ADE20K, with three state-of-the-art Vision Transformer architectures : SegFormer, Swin Transformer, and Twins demonstrate that CWFA-enhanced models significantly improve robustness without affecting clean data performance. For instance, on Cityscapes, a CWFA-augmented SegFormer-B1 model yields up to 27.7% mIoU robustness gain on impulse noise compared to the non-augmented SegFormer-B1. Furthermore, CWFA-augmented SegFormer-B5 achieves a new state-of-the-art 84.3% retention rate, a 0.7% improvement over the recently published FAN+STL.

Paper Structure

This paper contains 14 sections, 3 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Left: We propose CWFA, a feature augmentation module for Vision Transformers. CWFA computes a feature perturbation based on a global average feature rather than independent perturbations for each feature. Right: Compared to the baseline SegFormer and other CNNs and Transformer models, our approach consistently outperforms them independently of the model size and yields up to 27% improvements . Our results with large models (SegFormer-B5+CWFA) set a new state-of-the-art retention ratio for semantic segmentation.
  • Figure 2: Global Average Feature Augmentation
  • Figure 3: Example results. Our method shows robustness improvements compared to existing approaches.
  • Figure 4: Sensitivity of SegFormer baseline models towards CWFA and choice of $\epsilon$. Sensitivity of SegFormer models when perturbing the feature with different $\epsilon$ values during inference on the Cityscapes validation set. a) Sensitivity as a function of the model size when applying perturbing the features from the first encoder of the original models. b) Sensitivity of SegFormer-B0 as a function of the encoder for the original model and a model fine-tuned with CWFA.
  • Figure 5: Robustness as a function of $\epsilon$. Sensitivity of SegFormer-B0 fine-tuned using CWFA for different perturbation strength. Evaluation on City-C. Our approach is not very sensitive to the choice of $\epsilon$ as we obtain similar robustness gains when choosing $\epsilon$ in a wide range of values
  • ...and 3 more figures