Benchmarking Object Detectors with COCO: A New Path Forward

Shweta Singh; Aayan Yadav; Jitesh Jain; Humphrey Shi; Justin Johnson; Karan Desai

Benchmarking Object Detectors with COCO: A New Path Forward

Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai

TL;DR

This work identifies substantive flaws in COCO-2017 ground-truth masks that can mislead benchmarking of contemporary object detectors. It introduces COCO-ReM, a semi-automatic, three-stage annotation pipeline that refines mask boundaries with SAM, adds exhaustive LVIS-based annotations, and manually corrects labeling issues, resulting in a validation set with ~$40{,}689$ masks and a training set with ~1.09M masks. Comprehensive experiments across fifty detectors show that COCO-ReM yields higher AP, reorders model rankings—especially favoring query-based approaches like Mask2Former and OneFormer—and enables faster convergence when training, underscoring the impact of data quality on detector performance. The authors provide a dataset and tooling to enable broader adoption of COCO-ReM in future research, emphasizing more reliable benchmarking for progress in object detection and segmentation.

Abstract

The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer, we inspect thousands of masks from COCO (2017 version) and uncover different types of errors such as imprecise mask boundaries, non-exhaustively annotated instances, and mislabeled masks. Due to the prevalence of COCO, we choose to correct these errors to maintain continuity with prior research. We develop COCO-ReM (Refined Masks), a cleaner set of annotations with visibly better mask quality than COCO-2017. We evaluate fifty object detectors and find that models that predict visually sharper masks score higher on COCO-ReM, affirming that they were being incorrectly penalized due to errors in COCO-2017. Moreover, our models trained using COCO-ReM converge faster and score higher than their larger variants trained using COCO-2017, highlighting the importance of data quality in improving object detectors. With these findings, we advocate using COCO-ReM for future object detection research. Our dataset is available at https://cocorem.xyz

Benchmarking Object Detectors with COCO: A New Path Forward

TL;DR

masks and a training set with ~1.09M masks. Comprehensive experiments across fifty detectors show that COCO-ReM yields higher AP, reorders model rankings—especially favoring query-based approaches like Mask2Former and OneFormer—and enables faster convergence when training, underscoring the impact of data quality on detector performance. The authors provide a dataset and tooling to enable broader adoption of COCO-ReM in future research, emphasizing more reliable benchmarking for progress in object detection and segmentation.

Abstract

Paper Structure (39 sections, 11 figures, 3 tables)

This paper contains 39 sections, 11 figures, 3 tables.

Introduction
Contributions.
Background: Revisiting COCO Masks
COCO annotation pipeline.
Imperfections in COCO-2017 masks.
COCO-ReM: COCO with Refined Masks
Annotation Pipeline
Stage 1. Mask boundary refinement.
Stage 2. Exhaustive instance annotation.
Stage 3. Correction of labeling errors.
Mask Characteristics
Instance statistics.
Mask IoU.
Holes in masks.
Experiments
...and 24 more sections

Figures (11)

Figure 1: COCO with Refined Masks (COCO-ReM).Top: COCO-ReM improves upon the quality of COCO-2017 masks by handling occlusions consistently (dining table) and annotating instances exhaustively (banana). Bottom: Random examples from COCO-ReM showing the mask quality (single category per image, for clarity).
Figure 2: Imperfections in COCO-2017 masks. All masks in general have coarse boundaries. Moreover, they lack holes when required (scissors) and do not handle occlusions consistently (TV with or without figurines). Annotations are sometimes non-exhaustive, either masks are missing (apples and donuts), or multiple instances are grouped into a single mask (banana and dog).
Figure 3: COCO-ReM annotation pipeline. We develop a semi-automatic annotation pipeline comprising three stages to rectify the imperfections in COCO-2017 masks.
Figure 4: Left: Mask IoU distribution. Distribution of IoU between masks in COCO-ReM validation set and their source dataset. Right: Masks with holes. Nearly 2000 masks in COCO-ReM validation set have holes, that were missing in COCO-2017.
Figure 5: Evaluating fifty object detectors using COCO-ReM. All models score higher on COCO-ReM than COCO-2017. Query-based models ($\bigstar$) score much higher than region-based models on COCO-ReM, yielding opposite trends than COCO-2017.
...and 6 more figures

Benchmarking Object Detectors with COCO: A New Path Forward

TL;DR

Abstract

Benchmarking Object Detectors with COCO: A New Path Forward

Authors

TL;DR

Abstract

Table of Contents

Figures (11)