Table of Contents
Fetching ...

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

Selim Kuzucu, Kemal Oksuz, Jonathan Sadeghi, Puneet K. Dokania

TL;DR

The paper argues that calibrating object detectors requires joint evaluation of accuracy and confidence, and it shows that prevailing metrics and train-time calibration approaches can yield misleading conclusions. It introduces LaECE_0 and LaACE_0 as fine-grained, localisation-aware calibration errors and pairs them with LRPPAMI for a threshold-aware, model-dependent evaluation framework, validated on COCO, Cityscapes, and LVIS with domain-shift splits. The authors also propose lightweight post-hoc calibrators (Platt Scaling and Isotonic Regression) tailored to detection and demonstrate they substantially outperform training-time calibration methods on multiple detectors and tasks, including instance segmentation and long-tailed datasets. The framework provides practical baselines and guidance for robust calibration in safety-critical vision systems and highlights the importance of proper dataset design and operating conditions for fair comparisons.

Abstract

Reliable usage of object detectors require them to be calibrated -- a crucial problem that requires careful attention. Recent approaches towards this involve (1) designing new loss functions to obtain calibrated detectors by training them from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to scale the likelihood of a trained detector to output calibrated predictions. These approaches are then evaluated based on a combination of Detection Expected Calibration Error (D-ECE) and Average Precision. In this work, via extensive analysis and insights, we highlight that these recent evaluation frameworks, evaluation metrics, and the use of TS have notable drawbacks leading to incorrect conclusions. As a step towards fixing these issues, we propose a principled evaluation framework to jointly measure calibration and accuracy of object detectors. We also tailor efficient and easy-to-use post-hoc calibration approaches such as Platt Scaling and Isotonic Regression specifically for object detection task. Contrary to the common notion, our experiments show that once designed and evaluated properly, post-hoc calibrators, which are extremely cheap to build and use, are much more powerful and effective than the recent train-time calibration methods. To illustrate, D-DETR with our post-hoc Isotonic Regression calibrator outperforms the recent train-time state-of-the-art calibration method Cal-DETR by more than 7 D-ECE on the COCO dataset. Additionally, we propose improved versions of the recently proposed Localization-aware ECE and show the efficacy of our method on these metrics as well. Code is available at: https://github.com/fiveai/detection_calibration.

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

TL;DR

The paper argues that calibrating object detectors requires joint evaluation of accuracy and confidence, and it shows that prevailing metrics and train-time calibration approaches can yield misleading conclusions. It introduces LaECE_0 and LaACE_0 as fine-grained, localisation-aware calibration errors and pairs them with LRPPAMI for a threshold-aware, model-dependent evaluation framework, validated on COCO, Cityscapes, and LVIS with domain-shift splits. The authors also propose lightweight post-hoc calibrators (Platt Scaling and Isotonic Regression) tailored to detection and demonstrate they substantially outperform training-time calibration methods on multiple detectors and tasks, including instance segmentation and long-tailed datasets. The framework provides practical baselines and guidance for robust calibration in safety-critical vision systems and highlights the importance of proper dataset design and operating conditions for fair comparisons.

Abstract

Reliable usage of object detectors require them to be calibrated -- a crucial problem that requires careful attention. Recent approaches towards this involve (1) designing new loss functions to obtain calibrated detectors by training them from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to scale the likelihood of a trained detector to output calibrated predictions. These approaches are then evaluated based on a combination of Detection Expected Calibration Error (D-ECE) and Average Precision. In this work, via extensive analysis and insights, we highlight that these recent evaluation frameworks, evaluation metrics, and the use of TS have notable drawbacks leading to incorrect conclusions. As a step towards fixing these issues, we propose a principled evaluation framework to jointly measure calibration and accuracy of object detectors. We also tailor efficient and easy-to-use post-hoc calibration approaches such as Platt Scaling and Isotonic Regression specifically for object detection task. Contrary to the common notion, our experiments show that once designed and evaluated properly, post-hoc calibrators, which are extremely cheap to build and use, are much more powerful and effective than the recent train-time calibration methods. To illustrate, D-DETR with our post-hoc Isotonic Regression calibrator outperforms the recent train-time state-of-the-art calibration method Cal-DETR by more than 7 D-ECE on the COCO dataset. Additionally, we propose improved versions of the recently proposed Localization-aware ECE and show the efficacy of our method on these metrics as well. Code is available at: https://github.com/fiveai/detection_calibration.
Paper Structure (36 sections, 35 equations, 8 figures, 18 tables, 2 algorithms)

This paper contains 36 sections, 35 equations, 8 figures, 18 tables, 2 algorithms.

Figures (8)

  • Figure 1: The performance of different detectors over operating confidence thresholds on COCO minitest. Orange: Faster R-CNN, Green: RS R-CNN, Purple: ATSS, Red: PAA, Blue: D-DETR. All measures are lower better except AP. It is not trivial to identify an operating threshold and compare detectors, especially when the common evaluation CalibrationODmunir2022tcdMCCLmunir2023bpcmunir2023caldetr, combining D-ECE for calibration and AP for accuracy, is used. Instead, we use $\mathrm{LaECE_0}$ and .
  • Figure 2: Comparison of calibration methods in terms of on COCO mini-test using D-DETR DDETR. Post-hoc calibrators and are obtained on a subset of Objects365 Objects365 following -style evaluation.
  • Figure 3: A pictorial comparison of the different calibration errors. (a) Uncalibrated detections of D-DETR on an image from bdd100k. The detections on the left and right have of $0.74$ and $0.48$ with the objects. (b) Calibrated detections in terms of and $\mathrm{LaECE}$ using $\tau=0.50$, and $\mathrm{_C}$, COCO-style as in popordanoska2024CE. $\mathrm{D-ECE_C}=?$ as calibration error does not have a global minimum as shown in (d). (c) Calibrated detections in terms of $\mathrm{LaECE_{0}}$ and $\mathrm{LaACE_{0}}$ in which confidence matches . (d-f) Calibration errors for different types of detections, for which $\mathrm{LaACE_{0}}$ behave the same as $\mathrm{LaECE_{0}}$, hence excluded for clarity. App. \ref{['app:analyses']} presents the details.
  • Figure 4: The reliability diagrams of UP-DETR.
  • Figure A.5: $\mathrm{LaACE}_0$ (red dashed line at $27.1$) and $\mathrm{LaECE}_0$ over different number of bins (blue curve) using uncalibrated D-DETR on COCO minitest. The number of bins starts from the original $25$ bins for and gets multiplied up by $2$ for each step. $\mathrm{LaECE}_0$ converges to $\mathrm{LaACE}_0$ as the number of bins increases.
  • ...and 3 more figures