Table of Contents
Fetching ...

SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification

Shashank Agnihotri, David Schader, Jonas Jakubassa, Nico Sharei, Simon Kral, Mehmet Ege Kaçar, Ruben Weber, Margret Keuper

TL;DR

The paper tackles reliability and generalization for semantic segmentation and object detection under distribution shifts and adversarial perturbations, extending robustness benchmarks beyond classification. It introduces SemSegBench and DetecBench, unified benchmarking tools built on mmsegmentation and mmdetection, and reports the largest-scale analysis to date across 76 segmentation models over 4 datasets and 61 detectors over 2 datasets, under adversarial attacks and common corruptions. Two scalable metrics, the Reliability Measure ($\mathrm{ReM}$) and Generalization Ability Measure ($\mathrm{GAM}$), quantify worst-case performance and generalization, revealing that architectural design and backbone type strongly influence robustness, while gains in in-domain accuracy do not guarantee improved reliability or OOD generalization. The work provides open-source benchmarks to enable rapid, standardized analysis and guides the design of more robust semantic segmentation and object detection models for real-world safety-critical applications.

Abstract

Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.

SemSegBench & DetecBench: Benchmarking Reliability and Generalization Beyond Classification

TL;DR

The paper tackles reliability and generalization for semantic segmentation and object detection under distribution shifts and adversarial perturbations, extending robustness benchmarks beyond classification. It introduces SemSegBench and DetecBench, unified benchmarking tools built on mmsegmentation and mmdetection, and reports the largest-scale analysis to date across 76 segmentation models over 4 datasets and 61 detectors over 2 datasets, under adversarial attacks and common corruptions. Two scalable metrics, the Reliability Measure () and Generalization Ability Measure (), quantify worst-case performance and generalization, revealing that architectural design and backbone type strongly influence robustness, while gains in in-domain accuracy do not guarantee improved reliability or OOD generalization. The work provides open-source benchmarks to enable rapid, standardized analysis and guides the design of more robust semantic segmentation and object detection models for real-world safety-critical applications.

Abstract

Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real-world applications in safety-critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state-of-the-art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open-sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.

Paper Structure

This paper contains 86 sections, 17 equations, 31 figures, 4 tables.

Figures (31)

  • Figure 1: An overview of semantic segmentation (top) and object detection (bottom) methods proposed over time and their reliability and generalization ability on ADE20K ade20k and MS-COCO ms-coco, respectively. The y-axes represent TOP: the mean Intersection over Union (mIoU) and BOTTOM: the mean Average Precision (mAP), i.e. higher is better. The performance of methods on i.i.d. samples has increased over time due to different architecture and other design choices, however, their reliability and generalization ability have not improved at the same rate, and lag behind.
  • Figure 2: Semantic Segmentation using the ADE20K dataset. The colors represent the architecture of the method, while the shapes of the markers represent the backbone of the respective method. All methods were trained on the train set of the ADE20K dataset. Please refer to the Appendix for results with other datasets i.e. Cityscapes and PASCAL VOC2012, additionally in the Appendix we show high correlation between performance across different datasets. Subfigure numbers are left to right.
  • Figure 3: Object Detection using the MS-COCO dataset. The colors represent the backbone of the respective method, while different marker shapes represent the architecture of the method. All methods were trained on the train set of the MS-COCO dataset. The numbers in subcaptions for the respective subfigures are left to right.
  • Figure 4: To empirically determine if synthetic common corruptions such that those proposed by commoncorruptions truly represent the distribution and domain shifts in the real world we try to find correlations in evaluations on ACDC and 2D Common Corruptions. Each model is trained on the training dataset of the Cityscapes dataset. The y-axis represents values from evaluations on the ACDC dataset, and the x-axis represents values from evaluations on the Common Corruptions at severity=3. Starting from the left, we find correlations between ACDC the following: first the mean performance across all common corruptions; second the synthetic brightness corruption; third the synthetic snow corruption; and fourth the synthetic fog corruption. We observe a positive correlation, and strong positive correlation between performance on the ACDC and mean performance against all synthetic common corruption.
  • Figure 5: To better understand the correlations from \ref{['fig:correlation_2dcc_acdc']}, here we look at the correaltions between ACDC mIoU, $\mathrm{GAM}_3$, and the mean mIoU across all 2D Common Corruptions.
  • ...and 26 more figures