Table of Contents
Fetching ...

COCONut: Modernizing COCO Segmentation

Xueqing Deng, Qihang Yu, Peng Wang, Xiaohui Shen, Liang-Chieh Chen

TL;DR

This work addresses the limitations of COCO segmentation by introducing COCONut, a large-scale, human-verified universal segmentation dataset that harmonizes semantic, instance, and panoptic annotations across 133 classes. It presents an assisted-manual annotation pipeline and a data engine that scale from 118K to 358K training images, achieving 383K images and 5.18M masks, plus a high-quality 25K-image validation set. Analyses show improved annotation quality and benchmarking stability, while pseudo-labels offer limited gains compared to fully human-labeled data. The dataset enables more reliable evaluation and training for modern segmentation models, with clear evidence that larger, high-quality, human-annotated data enhances performance across tasks and backbones, and a more challenging COCONut-val improves model assessment.

Abstract

In recent decades, the vision community has witnessed remarkable progress in visual recognition, partially owing to advancements in dataset benchmarks. Notably, the established COCO benchmark has propelled the development of modern detection and segmentation systems. However, the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances, it gradually incorporated coarse superpixel annotations for stuff regions, which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations, executed by different groups of raters, have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study, we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic, instance, and panoptic segmentation with meticulously crafted high-quality masks, and establishes a robust benchmark for all segmentation tasks. To our knowledge, COCONut stands as the inaugural large-scale universal segmentation dataset, verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks.

COCONut: Modernizing COCO Segmentation

TL;DR

This work addresses the limitations of COCO segmentation by introducing COCONut, a large-scale, human-verified universal segmentation dataset that harmonizes semantic, instance, and panoptic annotations across 133 classes. It presents an assisted-manual annotation pipeline and a data engine that scale from 118K to 358K training images, achieving 383K images and 5.18M masks, plus a high-quality 25K-image validation set. Analyses show improved annotation quality and benchmarking stability, while pseudo-labels offer limited gains compared to fully human-labeled data. The dataset enables more reliable evaluation and training for modern segmentation models, with clear evidence that larger, high-quality, human-annotated data enhances performance across tasks and backbones, and a more challenging COCONut-val improves model assessment.

Abstract

In recent decades, the vision community has witnessed remarkable progress in visual recognition, partially owing to advancements in dataset benchmarks. Notably, the established COCO benchmark has propelled the development of modern detection and segmentation systems. However, the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances, it gradually incorporated coarse superpixel annotations for stuff regions, which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations, executed by different groups of raters, have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study, we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic, instance, and panoptic segmentation with meticulously crafted high-quality masks, and establishes a robust benchmark for all segmentation tasks. To our knowledge, COCONut stands as the inaugural large-scale universal segmentation dataset, verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks.
Paper Structure (18 sections, 16 figures, 17 tables)

This paper contains 18 sections, 16 figures, 17 tables.

Figures (16)

  • Figure 1: Overview of COCONut, the COCONext Universal segmenTation dataset:Top: COCONut, comprising images from COCO and Objects365, constitutes a diverse collection annotated with high-quality masks and semantic classes. Bottom: COCONut empowers a multitude of image understanding tasks.
  • Figure 2: Annotation Comparison: We delineate erroneous annotations from COCO using yellow dotted line boxes, juxtaposed with our COCONut corrected annotations. Common COCO annotation errors include over-annotations (e.g., 'person crowd' erroneously extends into 'playingfield'), incomplete mask fragments (e.g., 'table-merged' and 'blanket' are annotated in small isolated segments), missing annotations (e.g., 'tree-merged' remains unannotated), coarse segmentations (especially noticeable in 'stuff' regions annotated by superpixels and in 'thing' regions by loose polygons), and wrong semantic categories (e.g., 'tree-merged' is incorrectly tagged as 'dirt-merged').
  • Figure 3: Overview of the Proposed Assisted-Manual Annotation Pipeline: To streamline the labor-intensive labeling task, our annotation pipeline encompasses four pivotal stages: (1) machine-generated pseudo labels, (2) human inspection and editing, (3) mask generation or refinement, and (4) quality verification. Acknowledging the inherent distinctions between 'thing' and 'stuff' classes, we systematically address these intricacies at each stage. Stage 1: Machines are employed to generate box and mask proposals for 'thing' and 'stuff', respectively. Stage 2: Raters assess the proposal qualities using a meticulously crafted questionnaire. For proposals falling short of requirements, raters can update them by editing boxes or adding positive/negative points for 'thing' and 'stuff', respectively. Stage 3: We utilize Box2Mask and Point2Mask modules to generate masks based on the inputs from stage 2. Stage 4: Experts perform a comprehensive verification of annotation quality, with relabeling done if the quality falls below our stringent standards.
  • Figure 4: Mask Prediction Comparison: In contrast to kMaX-DeepLab, box-kMaX (Box2Mask module) leverages box queries, initialized with features pooled from the backbone within the box regions, enabling more accurate segmentation of 'thing' objects. Notably, kMaX-DeepLab falls short in capturing the challenging 'baseball bat' and the heavily occluded 'person' in the figure.
  • Figure 5: Annotation Comparison: We show annotations obtained by COCO, COCONut (Box2Mask for 'thing' in (a) or Point2Mask for 'stuff' in (b)), and our expert rater. COCONut's annotation exhibits sharper boundaries, closely resembling expert results, as evident from higher IoU values. The blue and red regions correspond to extra and missing regions, respectively, compared to the expert mask.
  • ...and 11 more figures