Quantifying the Reliability of Predictions in Detection Transformers: Object-Level Calibration and Image-Level Uncertainty

Young-Jin Park; Carson Sobolewski; Navid Azizan

Paper

Quantifying the Reliability of Predictions in Detection Transformers: Object-Level Calibration and Image-Level Uncertainty

Abstract

DETR and its variants have emerged as promising architectures for object detection, offering an end-to-end prediction pipeline. In practice, however, DETRs generate hundreds of predictions that far outnumber the actual objects present in an image. This raises a critical question: which of these predictions could be trusted? Addressing this concern, we provide empirical and theoretical evidence that predictions within the same image play distinct roles, resulting in varying reliability levels. Our analysis reveals that DETRs employ an optimal specialist strategy: one prediction per object is trained to be well-calibrated, while the remaining predictions are trained to suppress their foreground confidence to near zero, even when maintaining accurate localization. We show that this strategy emerges as the loss-minimizing solution to the Hungarian matching algorithm, fundamentally shaping DETRs' outputs. While selecting the well-calibrated predictions is ideal, they are unidentifiable at inference time. This means that any post-processing algorithm poses a risk of outputting a set of predictions with mixed calibration levels. Therefore, practical deployment necessitates a joint evaluation of both the model's calibration quality and the effectiveness of the post-processing algorithm. However, we demonstrate that existing metrics like average precision and expected calibration error are inadequate for this task. To address this issue, we further introduce Object-level Calibration Error (OCE): This object-centric design penalizes both retaining suppressed predictions and missed ground truth foreground objects, making OCE suitable for both evaluating models and identifying reliable prediction subsets. Finally, we present a post hoc uncertainty quantification framework that predicts per-image model accuracy.