Table of Contents
Fetching ...

Reliability in Semantic Segmentation: Can We Use Synthetic Data?

Thibaut Loiseau, Tuan-Hung Vu, Mickael Chen, Patrick Pérez, Matthieu Cord

TL;DR

The paper tackles the challenge of validating semantic segmentation under covariate shifts and unseen OOD inputs by introducing a zero-shot synthetic-data framework built on ControlNet and Stable Diffusion trained with in-domain Cityscapes data. This pipeline generates OOD-domain images and inpainted OOD objects without real OOD data, enabling robust evaluation of 40 pretrained segmenters across diverse shifts. It demonstrates strong correlations between model performance on synthetic OOD data and real OOD data, and shows synthetic data can improve calibration and OOD detection when used for testing or training. The approach offers a scalable, cost-effective avenue for virtual reliability testing in safety-critical settings and provides practical guidance on data requirements and domain coverage for effective validation.

Abstract

Assessing the robustness of perception models to covariate shifts and their ability to detect out-of-distribution (OOD) inputs is crucial for safety-critical applications such as autonomous vehicles. By nature of such applications, however, the relevant data is difficult to collect and annotate. In this paper, we show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. By fine-tuning Stable Diffusion with only in-domain data, we perform zero-shot generation of visual scenes in OOD domains or inpainted with OOD objects. This synthetic data is employed to evaluate the robustness of pretrained segmenters, thereby offering insights into their performance when confronted with real edge cases. Through extensive experiments, we demonstrate a high correlation between the performance of models when evaluated on our synthetic OOD data and when evaluated on real OOD inputs, showing the relevance of such virtual testing. Furthermore, we demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters. Code and data are made public.

Reliability in Semantic Segmentation: Can We Use Synthetic Data?

TL;DR

The paper tackles the challenge of validating semantic segmentation under covariate shifts and unseen OOD inputs by introducing a zero-shot synthetic-data framework built on ControlNet and Stable Diffusion trained with in-domain Cityscapes data. This pipeline generates OOD-domain images and inpainted OOD objects without real OOD data, enabling robust evaluation of 40 pretrained segmenters across diverse shifts. It demonstrates strong correlations between model performance on synthetic OOD data and real OOD data, and shows synthetic data can improve calibration and OOD detection when used for testing or training. The approach offers a scalable, cost-effective avenue for virtual reliability testing in safety-critical settings and provides practical guidance on data requirements and domain coverage for effective validation.

Abstract

Assessing the robustness of perception models to covariate shifts and their ability to detect out-of-distribution (OOD) inputs is crucial for safety-critical applications such as autonomous vehicles. By nature of such applications, however, the relevant data is difficult to collect and annotate. In this paper, we show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models. By fine-tuning Stable Diffusion with only in-domain data, we perform zero-shot generation of visual scenes in OOD domains or inpainted with OOD objects. This synthetic data is employed to evaluate the robustness of pretrained segmenters, thereby offering insights into their performance when confronted with real edge cases. Through extensive experiments, we demonstrate a high correlation between the performance of models when evaluated on our synthetic OOD data and when evaluated on real OOD inputs, showing the relevance of such virtual testing. Furthermore, we demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters. Code and data are made public.
Paper Structure (23 sections, 17 figures, 4 tables)

This paper contains 23 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Assessing 40 pretrained segmenters under covariate shifts. Segmentation models under scrutiny were trained on Cityscapes train set only (in-domain data). They are evaluated on (i) Cityscapes validation set, (ii) real OOD data, and (iii) proposed synthetic data. We observe a strong correlation between results on (ii) and (iii).
  • Figure 2: Generating data with covariate shifts. Training (left) and Sampling (right) processes for producing the synthetic data with shifts. For training, only in-domain images and masks are used. For inference, we use the in-domain masks to generate OOD images. No real OOD data is required in the framework.
  • Figure 3: Robustness correlation between real and synthetic covariate shifts across $40$ pretrained segmenters. The tested models, see families in bottom legend, cover different architectures and sizes. (top) Pearson Correlation Coefficients of mIoUs between Cityscapes and real-shifts ('PCC_CS' ), and between synthetic shifts and real ones ('PCC_Syn' ). (bottom) Scatter plots of synthetic vs. real mIoUs along with the linear regression line accompanied by $95\%$ confidence intervals ('CI'). (a-e) Five types of domain shifts from Cityscapes in-domain distribution, sorted by increasing gap as assessed by decreasing PCC_CS. The robustness results on synthetic data exhibit a strong correlation with those on real data, particularly in the case of the most distant shifts like 'snow' and 'night'. More details are provided in \ref{['sec:class_wise']}.
  • Figure 4: Day-night shift. Models are ranked from left to right by their robustness on real night data -- ACDC-Night mIoUs are shown on top of model names. For each presented architecture, the most robust model on Cityscapes is tested; the Semantic-FPN, DeeplabV3+, and PSPNet models have ResNet-101 as backbone. The Semantic-FPN model (lowest mIoU on ACDC-Night) serves as the reference for computing the relative mIoUs. Blue bars or orange bars show the relative mIoUs when testing on our synthetic data () or testing on Cityscapes validation data (). Cityscapes scores are not reliable for ranking models in the night domain. Synthetic scores exhibit a stronger correlation with real night scores, as evidenced by the more consistently increasing trend in the blue bars from left to right.
  • Figure 5: Pearson Correlation vs.# Synthetic Samples. Using more synthetic samples contributes to increased stability in the results. Empirical plots demonstrate that approximately $500$ samples are sufficient for a stable correlation assessment.
  • ...and 12 more figures