Semantic Understanding of Scenes through the ADE20K Dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba
TL;DR
The paper introduces ADE20K, a densely annotated dataset for pixel-level scene understanding that includes stuff, objects, and object parts, annotated by a single expert to ensure consistency. It introduces the Cascade Segmentation Module to parse scenes in a cascade, yielding improvements in scene parsing when integrated into baseline networks. Two benchmarks, SceneParse150 for scene parsing and InstSeg100 for instance segmentation, are established, with reproducible re-implementations of state-of-the-art models, analysis of batch normalization effects, and insights from the Places Challenges. The work also demonstrates practical applications such as hierarchical semantic segmentation, automatic content removal, and scene synthesis, and releases the dataset and pretrained models to the community.
Abstract
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.
