Table of Contents
Fetching ...

Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments

Yang Yang, Wenhai Wang, Zhe Chen, Jifeng Dai, Liang Zheng

TL;DR

Under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes, contributing to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision under various test environments.

Abstract

Bounding boxes uniquely characterize object detection, where a good detector gives accurate bounding boxes of categories of interest. However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability. In this work, we find under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes. We compute the box stability score (BoS score) to reflect this stability. Specifically, given an image, we compute a normal set of bounding boxes and a second set after feature map dropout. To obtain BoS score, we use bipartite matching to find the corresponding boxes between the two sets and compute the average Intersection over Union (IoU) across the entire test set. We contribute to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision (mAP) under various test environments. This relationship allows us to predict the accuracy of detectors on various real-world test sets without accessing test ground truths, verified on canonical detection tasks such as vehicle detection and pedestrian detection. Code and data are available at https://github.com/YangYangGirl/BoS.

Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments

TL;DR

Under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes, contributing to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision under various test environments.

Abstract

Bounding boxes uniquely characterize object detection, where a good detector gives accurate bounding boxes of categories of interest. However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability. In this work, we find under feature map dropout, good detectors tend to output bounding boxes whose locations do not change much, while bounding boxes of poor detectors will undergo noticeable position changes. We compute the box stability score (BoS score) to reflect this stability. Specifically, given an image, we compute a normal set of bounding boxes and a second set after feature map dropout. To obtain BoS score, we use bipartite matching to find the corresponding boxes between the two sets and compute the average Intersection over Union (IoU) across the entire test set. We contribute to finding that BoS score has a strong, positive correlation with detection accuracy measured by mean average precision (mAP) under various test environments. This relationship allows us to predict the accuracy of detectors on various real-world test sets without accessing test ground truths, verified on canonical detection tasks such as vehicle detection and pedestrian detection. Code and data are available at https://github.com/YangYangGirl/BoS.
Paper Structure (25 sections, 5 equations, 10 figures, 4 tables)

This paper contains 25 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Visual examples illustrating the correlation between bounding box stability and detection accuracy. (a) Bounding boxes change significantly upon MC dropout, where we observe complex scenarios and incorrect detection results. (b) Bounding boxes remain at similar locations after dropout, and relatively correct detection results are observed. (c) We observed that the box stability score is positively correlated to mAP (coefficients of determination $R^2 > 0.94$, Spearman’s Rank Correlation $\rho > 0.93$).
  • Figure 2: Correlation between different measurements and mAP. "$\triangle$" of different colors represents nine real-world datasets, such as BDD yu2020bdd100k and Cityscapes cordts2016cityscapes, which are used as seeds to generate sample sets by various transformations. Each point "$\bullet$" of different colors represents a sample set generated from different seed sets. mAP is obtained by RetinaNet lin2017retina trained on the COCO training set. Three measurements including two existing methods hendrycks2016baselinegarg2022leveraging and our box stability score are used in (a), (b), and (c). The trend lines are computed by linear regression, from which we can observe a strong linear relationship (coefficients of determination $R^2 > 0.94$, Spearman’s Rank Correlation $\rho > 0.93$) between box stability score and mAP.
  • Figure 3: Comparing different (a) dropout positions and (b) dropout rate. We plot the coefficients of determination $R^2$ obtained during training under different dropout configurations and matching times, measured on three different training meta-sets. We use greedy search for their optimal values during training, so that $R^2$ is maximized. "[0]" means adding a dropout layer after stage 0 of the backbone, "[0, 1]" means adding a dropout after stage 0 and stage 1 of the backbone respectively, and so on. (c) AutoEval performance (RMSE, %) of confidence stability score (CS score), box stability score (BoS score), and fused dataset-level statistics. No obvious improvement is observed after fusion. "n.s." represents that the difference between results is not statistically significant ( i.e., $p-\text{value}> 0.05$). $\star$ corresponds to statistically significant ( i.e., $0.01< p-\text{value}< 0.05$). $\star\star$ means the difference between results is statistically very significant ( i.e., $0.001< p-\text{value}< 0.01$).
  • Figure 4: Impact of (a) meta-set size, (b) sample set size and (c) test set size on the performance of mAP estimators. RMSE (%) is reported. Generally speaking a larger meta-set and larger sample sets are beneficial for regression model training. Our system gives relatively low RMSE when the number of test images is more than 50.
  • Figure 5: Visual examples illustrating the correlation between bounding box stability and detection accuracy. (a) Bounding boxes change significantly upon MC dropout, where we observe complex scenarios and incorrect detection results. (b) Bounding boxes remain at similar locations after dropout, and relatively correct detection results are observed.
  • ...and 5 more figures