Table of Contents
Fetching ...

A Realistic Protocol for Evaluation of Weakly Supervised Object Localization

Shakeeb Murtaza, Soufiane Belharbi, Marco Pedersoli, Eric Granger

TL;DR

The paper addresses the misalignment between WSOL evaluation and real-world constraints by showing that reliance on manual bounding boxes for model selection and threshold estimation inflates reported localization performance. It introduces a realistic evaluation protocol that substitutes manual LOC supervision with pseudo-bboxes generated from region-proposal models (SS, CLIP, RPN) and uses these for validation-based model selection and LOC-map thresholding. Across CUB and ILSVRC, the authors demonstrate that models selected and thresholds estimated with pseudo-bboxes achieve LOC performance comparable to GT-based validation, while outperforming results obtained using only class-label supervision. This protocol reduces annotation costs, mitigates test-set leakage, and provides a more reproducible, practical framework for WSOL research and deployment.

Abstract

Weakly Supervised Object Localization (WSOL) allows training deep learning models for classification and localization (LOC) using only global class-level labels. The absence of bounding box (bbox) supervision during training raises challenges in the literature for hyper-parameter tuning, model selection, and evaluation. WSOL methods rely on a validation set with bbox annotations for model selection, and a test set with bbox annotations for threshold estimation for producing bboxes from localization maps. This approach, however, is not aligned with the WSOL setting as these annotations are typically unavailable in real-world scenarios. Our initial empirical analysis shows a significant decline in LOC performance when model selection and threshold estimation rely solely on class labels and the image itself, respectively, compared to using manual bbox annotations. This highlights the importance of incorporating bbox labels for optimal model performance. In this paper, a new WSOL evaluation protocol is proposed that provides LOC information without the need for manual bbox annotations. In particular, we generated noisy pseudo-boxes from a pretrained off-the-shelf region proposal method such as Selective Search, CLIP, and RPN for model selection. These bboxes are also employed to estimate the threshold from LOC maps, circumventing the need for test-set bbox annotations. Our experiments with several WSOL methods on ILSVRC and CUB datasets show that using the proposed pseudo-bboxes for validation facilitates the model selection and threshold estimation, with LOC performance comparable to those selected using GT bboxes on the validation set and threshold estimation on the test set. It also outperforms models selected using class-level labels, and then dynamically thresholded based solely on LOC maps.

A Realistic Protocol for Evaluation of Weakly Supervised Object Localization

TL;DR

The paper addresses the misalignment between WSOL evaluation and real-world constraints by showing that reliance on manual bounding boxes for model selection and threshold estimation inflates reported localization performance. It introduces a realistic evaluation protocol that substitutes manual LOC supervision with pseudo-bboxes generated from region-proposal models (SS, CLIP, RPN) and uses these for validation-based model selection and LOC-map thresholding. Across CUB and ILSVRC, the authors demonstrate that models selected and thresholds estimated with pseudo-bboxes achieve LOC performance comparable to GT-based validation, while outperforming results obtained using only class-label supervision. This protocol reduces annotation costs, mitigates test-set leakage, and provides a more reproducible, practical framework for WSOL research and deployment.

Abstract

Weakly Supervised Object Localization (WSOL) allows training deep learning models for classification and localization (LOC) using only global class-level labels. The absence of bounding box (bbox) supervision during training raises challenges in the literature for hyper-parameter tuning, model selection, and evaluation. WSOL methods rely on a validation set with bbox annotations for model selection, and a test set with bbox annotations for threshold estimation for producing bboxes from localization maps. This approach, however, is not aligned with the WSOL setting as these annotations are typically unavailable in real-world scenarios. Our initial empirical analysis shows a significant decline in LOC performance when model selection and threshold estimation rely solely on class labels and the image itself, respectively, compared to using manual bbox annotations. This highlights the importance of incorporating bbox labels for optimal model performance. In this paper, a new WSOL evaluation protocol is proposed that provides LOC information without the need for manual bbox annotations. In particular, we generated noisy pseudo-boxes from a pretrained off-the-shelf region proposal method such as Selective Search, CLIP, and RPN for model selection. These bboxes are also employed to estimate the threshold from LOC maps, circumventing the need for test-set bbox annotations. Our experiments with several WSOL methods on ILSVRC and CUB datasets show that using the proposed pseudo-bboxes for validation facilitates the model selection and threshold estimation, with LOC performance comparable to those selected using GT bboxes on the validation set and threshold estimation on the test set. It also outperforms models selected using class-level labels, and then dynamically thresholded based solely on LOC maps.
Paper Structure (16 sections, 2 equations, 4 figures, 6 tables)

This paper contains 16 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (A) Example of LOC (MaxBoxAcc) and CL accuracy w.r.t the number of epochs on the CUB validation set in a WSOL setting, where only image-class labels are used to train the model. These curves show that LOC and CL tasks are loosely correlated where convergence is reached at different training epochs. Typically, high LOC performance is achieved early in the training, followed by degradation. However, the classifier takes a longer time to converge. Selecting the best model for LOC may need a LOC annotation in the validation set. (B) Significant bias arises when using GT bounding boxes and test set thresholds, leading to overestimated LOC accuracy, as bbox annotations are typically absent in real-world scenarios. In a realistic WSOL scenario (CL-BV-OT), where model selection is based on CL accuracy and OT over the validation set, performance declines considerably.
  • Figure 4: Model selection with early stopping at the epoch indicated with a dot using LOC accuracy. Different approaches are compared for pseudo-bbox annotation versus GT annotations (oracle). Fig(B) is a zoom of Fig.(A) between epochs 0 and 5. Results are reported over CUB validation set using the CAM method zhou2016learning with IoU as a LOC measure. LOC curves with pseudo-bbox annotations typically have similar behaviour to the oracle GT bbox annotations, making them suitable for WSOL model selection. They increase and reach their peak for a similar number of epochs, followed by a decline and stagnation in performance. In contrast, the CL curve reaches its peak toward the end of training when LOC performance has already degraded. CL accuracy is therefore inadequate as a WSOL selection criterion to achieve high LOC accuracy performance. Misalignment between LOC and CL behaviour has been studied further in choe2020evaluating.
  • Figure 5: Our proposed LOC pseudo-bbox annotator ${\hat{o}}$. A set of bbox proposals is initially extracted using a region proposal model, and from them, discriminative ones are selected using pointing game analysis zhang2018top. In the case where multiple bboxes are selected, the classifier confidence over the foreground region is used to select the most discriminative bbox.
  • Figure 6: (A) The heatmap illustrates the inaccuracies in noisy bboxes at various noise levels generated by augmenting GT bboxes. (B) The blue line represents the average LOC accuracy over various configurations on the test set, incorporating maxima, minima, and standard deviation; the model is selected using a validation set comprising noisy bboxes with varying noise levels. In contrast, orange lines represent the average maximum and minimum performance of a model chosen during hyperparameter optimization over GT for configurations analogous to those employed with noisy labels. (C) Histogram illustrating the variance in model selection epochs when using bboxes at various noise levels w.r.t ground truth boxes for experiments with identical configurations, highlighting tendency around zero. (D) Model selection epoch frequency using GT and noisy boxes across different configurations and noise levels. (E) Illustration of pseudo-bboxes along with the GT over (E.1) CUB and (E.2) ILSVRC validation set.