Table of Contents
Fetching ...

Semantic Understanding of Scenes through the ADE20K Dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba

TL;DR

The paper introduces ADE20K, a densely annotated dataset for pixel-level scene understanding that includes stuff, objects, and object parts, annotated by a single expert to ensure consistency. It introduces the Cascade Segmentation Module to parse scenes in a cascade, yielding improvements in scene parsing when integrated into baseline networks. Two benchmarks, SceneParse150 for scene parsing and InstSeg100 for instance segmentation, are established, with reproducible re-implementations of state-of-the-art models, analysis of batch normalization effects, and insights from the Places Challenges. The work also demonstrates practical applications such as hierarchical semantic segmentation, automatic content removal, and scene synthesis, and releases the dataset and pretrained models to the community.

Abstract

Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.

Semantic Understanding of Scenes through the ADE20K Dataset

TL;DR

The paper introduces ADE20K, a densely annotated dataset for pixel-level scene understanding that includes stuff, objects, and object parts, annotated by a single expert to ensure consistency. It introduces the Cascade Segmentation Module to parse scenes in a cascade, yielding improvements in scene parsing when integrated into baseline networks. Two benchmarks, SceneParse150 for scene parsing and InstSeg100 for instance segmentation, are established, with reproducible re-implementations of state-of-the-art models, analysis of batch normalization effects, and insights from the Places Challenges. The work also demonstrates practical applications such as hierarchical semantic segmentation, automatic content removal, and scene synthesis, and releases the dataset and pretrained models to the community.

Abstract

Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.

Paper Structure

This paper contains 22 sections, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Images in ADE20K dataset are densely annotated in detail with objects and parts. The first row shows the sample images, the second row shows the annotation of objects, and the third row shows the annotation of object parts. The color scheme both encodes the object categories and object instances, that different object categories have large color difference while different instances from the same object category have small color difference (e.g., different person instances in first image have slightly different colors).
  • Figure 2: Annotation interface, the list of the objects and their associated parts in the image.
  • Figure 3: Section of the relation tree of objects and parts for the dataset. Each number indicates the number of instances for each object. The full relation tree is available at the dataset webpage.
  • Figure 4: Analysis of annotation consistency. Each column shows an image and two segmentations done by the same annotator at different times. Bottom row shows the pixel discrepancy when the two segmentations are subtracted, while the number at the bottom shows the percentage of pixels with the same label. On average across all re-annotated images, $82.4\%$ of pixels got the same label. In the example in the first column the percentage of pixels with the same label is relatively low because the annotator labeled the same region as 'snow' and 'ground' during the two rounds of annotation. In the third column, there were many objects in the scene and the annotator missed some between the two segmentations.
  • Figure 5: a) Object classes sorted by frequency. Only the top 270 classes with more than 100 annotated instances are shown. 68 classes have more than a 1000 segmented instances. b) Frequency of parts grouped by objects. There are more than 200 object classes with annotated parts. Only objects with 5 or more parts are shown in this plot (we show at most 7 parts for each object class). c) Objects ranked by the number of scenes they are part of. d) Object parts ranked by the number of objects they are part of. e) Examples of objects with doors. The bottom-right image is an example where the door does not behave as a part.
  • ...and 16 more figures