Table of Contents
Fetching ...

Condition-Invariant Semantic Segmentation

Christos Sakaridis, David Bruggemann, Fisher Yu, Luc Van Gool

TL;DR

This work tackles semantic segmentation under condition-level domain shifts, where appearance changes (not scene structure) degrade performance. It proposes Condition-Invariant Semantic Segmentation (CISS), which uses shallow stylization to create cross-domain views and introduces a feature invariance loss that aligns encoder features across those views, enabling the decoder to rely on stable representations. Empirically, CISS achieves state-of-the-art results on Cityscapes→Dark Zurich and strong performance on Cityscapes→ACDC, with notable zero-shot generalization to unseen nighttime datasets like BDD100K-night and ACDC-night. The approach demonstrates that internal feature alignment, coupled with stylization-based data augmentation, yields robust condition-invariant representations and improves generalization across diverse visual conditions. The method is generally applicable across stylization techniques and architectures, offering a practical path to reliable perception in autonomous systems under varying environmental conditions.

Abstract

Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime Cityscapes$\to$Dark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse Cityscapes$\to$ACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night. Code is publicly available at https://github.com/SysCV/CISS .

Condition-Invariant Semantic Segmentation

TL;DR

This work tackles semantic segmentation under condition-level domain shifts, where appearance changes (not scene structure) degrade performance. It proposes Condition-Invariant Semantic Segmentation (CISS), which uses shallow stylization to create cross-domain views and introduces a feature invariance loss that aligns encoder features across those views, enabling the decoder to rely on stable representations. Empirically, CISS achieves state-of-the-art results on Cityscapes→Dark Zurich and strong performance on Cityscapes→ACDC, with notable zero-shot generalization to unseen nighttime datasets like BDD100K-night and ACDC-night. The approach demonstrates that internal feature alignment, coupled with stylization-based data augmentation, yields robust condition-invariant representations and improves generalization across diverse visual conditions. The method is generally applicable across stylization techniques and architectures, offering a practical path to reliable perception in autonomous systems under varying environmental conditions.

Abstract

Adaptation of semantic segmentation networks to different visual conditions is vital for robust perception in autonomous cars and robots. However, previous work has shown that most feature-level adaptation methods, which employ adversarial training and are validated on synthetic-to-real adaptation, provide marginal gains in condition-level adaptation, being outperformed by simple pixel-level adaptation via stylization. Motivated by these findings, we propose to leverage stylization in performing feature-level adaptation by aligning the internal network features extracted by the encoder of the network from the original and the stylized view of each input image with a novel feature invariance loss. In this way, we encourage the encoder to extract features that are already invariant to the style of the input, allowing the decoder to focus on parsing these features and not on further abstracting from the specific style of the input. We implement our method, named Condition-Invariant Semantic Segmentation (CISS), on the current state-of-the-art domain adaptation architecture and achieve outstanding results on condition-level adaptation. In particular, CISS sets the new state of the art in the popular daytime-to-nighttime CityscapesDark Zurich benchmark. Furthermore, our method achieves the second-best performance on the normal-to-adverse CityscapesACDC benchmark. CISS is shown to generalize well to domains unseen during training, such as BDD100K-night and ACDC-night. Code is publicly available at https://github.com/SysCV/CISS .
Paper Structure (25 sections, 10 equations, 6 figures, 11 tables)

This paper contains 25 sections, 10 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: The domain shift from normal to adverse conditions presents challenges to top-performing state-of-the-art domain adaptation methods for semantic segmentation \ref{['fig:teaser:hrda']} due to the large resulting change in the appearance of classes. We propose a method that encourages invariance of internal features of segmentation networks to visual conditions by comparing features of different views of the same scene under the style of different domains, improving segmentation especially for classes which undergo large shifts.
  • Figure 2: Overview of our method. Two instances of a shallow stylization mapping $g$ are fed with the source and target image, $I_s$ and $I_t$, to produce versions stylized with the converse domain, $I_{s\to{}t}$ and $I_{t\to{}s}$. In this example, $I_{s\to{}t}$ and $I_{t\to{}s}$ are computed using FDA fda:adaptation. The four images are fed to a shared encoder $\phi$, the features of which are used to compute our feature invariance losses. The features of the original source and target images are further fed to a shared decoder $\omega$ to compute softmax predictions and respective cross-entropy losses. Double lines indicate shared weights.
  • Figure 3: Qualitative results on Cityscapes$\to{}$ACDC. From left to right: ACDC image, ground-truth annotation, HRDA hrda:domain:adaptation, and CISS. Best viewed on a screen and zoomed in.
  • Figure 4: Ablation of the point in the network where invariance is applied on Cityscapes$\to$ACDC. Evaluation is performed on the validation set of ACDC. The $x$-axis is logarithmic and shows the weight $\lambda_s$ of the feature invariance loss, which is applied here only on the source domain. Averages and standard deviations are plotted over three runs for each configuration. The two plotted lines share their leftmost point, which corresponds to $\lambda_s = 0$, i.e., not applying an invariance loss at all.
  • Figure 5: Ablation of the norm which is used in the feature invariance loss on Cityscapes$\to$ACDC. Evaluation is performed on the validation set of ACDC. The $x$-axis is logarithmic and shows the weight $\lambda_s$ of the feature invariance loss, which is applied here only on the source domain. Averages and standard deviations are plotted over three runs for each configuration. Results with the proposed, squared Frobenius norm are plotted in blue and those with the alternative, $L_1$ norm are plotted in red. The two plotted lines share their leftmost point, which corresponds to $\lambda_s = 0$, i.e., not applying an invariance loss at all.
  • ...and 1 more figures