Table of Contents
Fetching ...

LVIS: A Dataset for Large Vocabulary Instance Segmentation

Agrim Gupta, Piotr Dollár, Ross Girshick

TL;DR

LVIS introduces a large vocabulary instance segmentation benchmark to address the long-tail, open-set regime where many object categories have scarce per-category data. The authors propose an evaluation-first design using a federated dataset, enabling exhaustive per-category annotations while dramatically reducing labeling workload. They demonstrate high annotation quality, analyze dataset statistics, and validate that COCO-style detectors transfer reasonably to LVIS, while highlighting the pronounced challenges of low-shot categories. The work lays groundwork for developing segmentation methods that scale beyond hundreds of categories and for LD-level low-shot learning in vision tasks, with a public LVIS release and challenges to spur progress.

Abstract

Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced `el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ~2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.

LVIS: A Dataset for Large Vocabulary Instance Segmentation

TL;DR

LVIS introduces a large vocabulary instance segmentation benchmark to address the long-tail, open-set regime where many object categories have scarce per-category data. The authors propose an evaluation-first design using a federated dataset, enabling exhaustive per-category annotations while dramatically reducing labeling workload. They demonstrate high annotation quality, analyze dataset statistics, and validate that COCO-style detectors transfer reasonably to LVIS, while highlighting the pronounced challenges of low-shot categories. The work lays groundwork for developing segmentation methods that scale beyond hundreds of categories and for LD-level low-shot learning in vision tasks, with a public LVIS release and challenges to spur progress.

Abstract

Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced `el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ~2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.

Paper Structure

This paper contains 53 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Example annotations. We present LVIS, a new dataset for benchmarking Large Vocabulary Instance Segmentation in the 1000+ category regime with a challenging long tail of rare objects.
  • Figure 2: Category relationships from left to right: non-disjoint category pairs may be in partially overlapping, parent-child, or equivalent (synonym) relationships, implying that a single object may have multiple valid labels. The fair evaluation of an object detector must take the issue of multiple valid labels into account.
  • Figure 3: Example LVIS annotations (one category per image for clarity). See http://www.lvisdataset.org/explore.
  • Figure 4: Our annotation pipeline comprises six stages. Stage 1: Object Spotting elicits annotators to mark a single instance of many different categories per image. This stage is iterative and causes annotators to discover a long tail of categories. Stage 2: Exhaustive Instance Marking extends the stage 1 annotations to cover all instances of each spotted category. Here we show additional instances of book. Stages 3 and 4: Instance Segmentation and Verification are repeated back and forth until $\hbox{$\sim$}$99% of all segmentations pass a quality check. Stage 5: Exhaustive Annotations Verification checks that all instances are in fact segmented and flags categories that are missing one or more instances. Stage 6: Negative Labels are assigned by verifying that a subset of categories do not appear in the image.
  • Figure 5: Distribution of object centers in normalized image coordinates for four datasets. ADE20K exhibits the greatest spatial diversity, with LVIS achieving greater complexity than COCO and the Open Images v4 training set.\ref{['oidplot']}
  • ...and 8 more figures