Condition-Invariant Semantic Segmentation
Christos Sakaridis, David Bruggemann, Fisher Yu, Luc Van Gool
TL;DR
This work tackles semantic segmentation under condition-level domain shifts, where appearance changes (not scene structure) degrade performance. It proposes Condition-Invariant Semantic Segmentation (CISS), which uses shallow stylization to create cross-domain views and introduces a feature invariance loss that aligns encoder features across those views, enabling the decoder to rely on stable representations. Empirically, CISS achieves state-of-the-art results on Cityscapes→Dark Zurich and strong performance on Cityscapes→ACDC, with notable zero-shot generalization to unseen nighttime datasets like BDD100K-night and ACDC-night. The approach demonstrates that internal feature alignment, coupled with stylization-based data augmentation, yields robust condition-invariant representations and improves generalization across diverse visual conditions. The method is generally applicable across stylization techniques and architectures, offering a practical path to reliable perception in autonomous systems under varying environmental conditions.
Abstract
Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime Cityscapes$\to$Dark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse Cityscapes$\to$ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night. Code is publicly available at https://github.com/SysCV/CISS .
