Per-Pixel Classification is Not All You Need for Semantic Segmentation
Bowen Cheng, Alexander G. Schwing, Alexander Kirillov
TL;DR
The paper argues that per-pixel classification is not essential for semantic segmentation and introduces MaskFormer, a mask-classification model that unifies semantic- and instance-level segmentation using a single architecture, loss, and training paradigm. It leverages a DETR-style set prediction with a Transformer decoder to generate binary masks and associated class probabilities, enabling high performance on large-vocabulary datasets and competitive panoptic results. Extensive experiments on ADE20K, COCO, Cityscapes, and Mapillary Vistas demonstrate state-of-the-art results and highlight the advantages of mask classification in terms of parameter efficiency and computation, especially as class counts grow. The work shows that a box-free, mask-centric formulation can surpass traditional per-pixel approaches while simplifying the segmentation landscape and enabling broader applicability.
Abstract
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
