Table of Contents
Fetching ...

GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection

Yutong Yang, Katarina Popović, Julian Wiederer, Markus Braun, Vasileios Belagiannis, Bin Yang

TL;DR

GroupEnsemble is introduced, an efficient and effective uncertainty estimation method for DETR-like models that performs well under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets.

Abstract

Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.

GroupEnsemble: Efficient Uncertainty Estimation for DETR-based Object Detection

TL;DR

GroupEnsemble is introduced, an efficient and effective uncertainty estimation method for DETR-like models that performs well under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets.

Abstract

Detection Transformer (DETR) and its variants show strong performance on object detection, a key task for autonomous systems. However, a critical limitation of these models is that their confidence scores only reflect semantic uncertainty, failing to capture the equally important spatial uncertainty. This results in an incomplete assessment of the detection reliability. On the other hand, Deep Ensembles can tackle this by providing high-quality spatial uncertainty estimates. However, their immense memory consumption makes them impractical for real-world applications. A cheaper alternative, Monte Carlo (MC) Dropout, suffers from high latency due to the need of multiple forward passes during inference to estimate uncertainty. To address these limitations, we introduce GroupEnsemble, an efficient and effective uncertainty estimation method for DETR-like models. GroupEnsemble simultaneously predicts multiple individual detection sets by feeding additional diverse groups of object queries to the transformer decoder during inference. Each query group is transformed by the shared decoder in isolation and predicts a complete detection set for the same input. An attention mask is applied to the decoder to prevent inter-group query interactions, ensuring each group detects independently to achieve reliable ensemble-based uncertainty estimation. By leveraging the decoder's inherent parallelism, GroupEnsemble efficiently estimates uncertainty in a single forward pass without sequential repetition. We validated our method under autonomous driving scenes and common daily scenes using the Cityscapes and COCO datasets, respectively. The results show that a hybrid approach combining MC-Dropout and GroupEnsemble outperforms Deep Ensembles on several metrics at a fraction of the cost. The code is available at https://github.com/yutongy98/GroupEnsemble.
Paper Structure (31 sections, 8 equations, 5 figures, 3 tables)

This paper contains 31 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visualizations of semantic and spatial uncertainties estimated by GroupEnsemble. Semantic uncertainty is represented by the classification confidence (Class. Conf.), while spatial uncertainty is visualized using 95% confidence intervals (dashed lines) around the mean (solid line). The color coding indicates the level of uncertainty: green for certain detections, yellow for medium uncertainty, and red for high uncertainty. Factors such as occlusion or background clutter can lead to uncertainty in the location or size of the detected objects. By estimating both uncertainties, GroupEnsemble provides a comprehensive assessment of the reliability of the detections.
  • Figure 2: Overview of GroupEnsemble. For an input image, a standard DETR (a) outputs deterministic detections. In contrast, our GroupEnsemble (b) feeds $G-1$ additional groups of object queries to the decoder and employs a self-attention mask (gray grids indicate blocked attentions) to block all inter-group query interactions. This enables the decoder to independently and simultaneously transform each query group, predicting multiple individual detection sets in a single pass. These detections are then clustered and aggregated to produce final detections with semantic and spatial uncertainty estimates. Spatial uncertainty is depicted as dashed confidence intervals around the mean box. Larger intervals indicate higher uncertainty. e.g., for the person occluded by the truck door.
  • Figure 3: The detection performance per query group is similar. The mAP scores are obtained by training a Conditional DETR b15 with Group DETR b18 on the Cityscapes dataset b12, using five query groups.
  • Figure 4: Visualizations of detection clusters and the corresponding object queries (represented as center reference points) from different query groups on the Cityscapes dataset b12. Different colors are used to distinguish between groups. The diversity of reference points detecting the same ground-truth objects is evident. The reference points and bounding boxes may overlap.
  • Figure 5: D-ECE plots of the different confidence aggregation methods. The calibration plots reveal that using the mean confidence score of clusters as the final scores results in under-confident detections (as highlighted in the 2nd plot), whereas using the maximum confidence without scaling leads to slightly over-confident detections (highlighted in the 3rd plot). Both cases negatively impact calibration performance. On the other hand, using the scaled maximum confidence, as shown in the 4th plot, effectively addresses these issues and improves calibration.