Table of Contents
Fetching ...

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

Seungju Yoo, Hyuk Kwon, Joong-Won Hwang, Kibok Lee

TL;DR

The paper addresses the challenge of estimating object detector performance without ground-truth labels, especially under deployment-time distribution shifts. It introduces Prediction Consistency and Reliability (PCR), a two-score framework that leverages pre-NMS and post-NMS bounding boxes to infer localization and classification quality, and maps these signals to mAP via regression over a corruption-based meta-dataset. By adopting ImageNet-C style corruptions across varying severities, the authors provide a realistic and scalable benchmark for AutoEval in object detection. Empirical results show that PCR consistently outperforms existing AutoEval baselines for both vehicle and pedestrian detection, with robustness across corruption severities and gains when combined with Box Stability (BoS). This work enables label-free model selection and monitoring in real-world settings and establishes a practical AutoEval protocol for object detection.

Abstract

Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

TL;DR

The paper addresses the challenge of estimating object detector performance without ground-truth labels, especially under deployment-time distribution shifts. It introduces Prediction Consistency and Reliability (PCR), a two-score framework that leverages pre-NMS and post-NMS bounding boxes to infer localization and classification quality, and maps these signals to mAP via regression over a corruption-based meta-dataset. By adopting ImageNet-C style corruptions across varying severities, the authors provide a realistic and scalable benchmark for AutoEval in object detection. Empirical results show that PCR consistently outperforms existing AutoEval baselines for both vehicle and pedestrian detection, with robustness across corruption severities and gains when combined with Box Stability (BoS). This work enables label-free model selection and monitoring in real-world settings and establishes a practical AutoEval protocol for object detection.

Abstract

Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.

Paper Structure

This paper contains 22 sections, 9 equations, 13 figures, 25 tables.

Figures (13)

  • Figure 1: Visual example of PCR.Green boxes represent the ground-truth bounding boxes, and red and blue boxes denote an incorrect detection with low confidence and a correct detection with high confidence, respectively. Orange boxes show the pre-NMS candidate boxes, where the overlaid numbers indicate their confidence scores. (a) Consistency. The red box overlaps many pre-NMS boxes, yielding high consistency. Our consistency score measures spatial consistency with a merged pre-NMS box, motivated by the observation that boxes with low confidence and high consistency correlate with lower mAP. (b) Reliability. The blue box overlaps many pre-NMS boxes with high confidence scores, yielding high reliability. Our reliability score measures the proportion of overlapping pre-NMS boxes with high confidence scores, motivated by the observation that boxes with high confidence and high reliability correlate with higher mAP.
  • Figure 2: The average IoU between ground-truth boxes and final predictions, grouped by confidence level using a threshold of 0.5 across datasets. Predictions with low confidence generally exhibit lower IoU than those with high confidence, indicating a correlation between confidence and localization quality.
  • Figure 3: (a) A merged box tightly encloses all pre-NMS boxes associated with a post-NMS box. (b) Consistency is computed based on IoU and a closeness term measured by the normalized distance between the center points of a post-NMS box and its corresponding merged box.
  • Figure 4: (a) The consistency score $S^{\mathrm{C}}$ shows a strong negative correlation with mAP. (b) Predictions with low confidence and high consistency suggest that the detector consistently localizes the same region without any object, indicating a detection failure. In contrast, predictions with low confidence and low consistency provide insufficient information to make a decision. (c) The reliability score $S^{\mathrm{R}}$ shows a strong positive correlation with mAP. (d) Predictions with high confidence and high reliability suggest that the detector repeatedly localizes and classifies the same object, indicating a successful detection. In contrast, predictions with high confidence and low reliability provide insufficient information to make a decision.
  • Figure 4: Ablation on components of consistency.
  • ...and 8 more figures