Table of Contents
Fetching ...

Per-Pixel Classification is Not All You Need for Semantic Segmentation

Bowen Cheng, Alexander G. Schwing, Alexander Kirillov

TL;DR

The paper argues that per-pixel classification is not essential for semantic segmentation and introduces MaskFormer, a mask-classification model that unifies semantic- and instance-level segmentation using a single architecture, loss, and training paradigm. It leverages a DETR-style set prediction with a Transformer decoder to generate binary masks and associated class probabilities, enabling high performance on large-vocabulary datasets and competitive panoptic results. Extensive experiments on ADE20K, COCO, Cityscapes, and Mapillary Vistas demonstrate state-of-the-art results and highlight the advantages of mask classification in terms of parameter efficiency and computation, especially as class counts grow. The work shows that a box-free, mask-centric formulation can surpass traditional per-pixel approaches while simplifying the segmentation landscape and enabling broader applicability.

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Per-Pixel Classification is Not All You Need for Semantic Segmentation

TL;DR

The paper argues that per-pixel classification is not essential for semantic segmentation and introduces MaskFormer, a mask-classification model that unifies semantic- and instance-level segmentation using a single architecture, loss, and training paradigm. It leverages a DETR-style set prediction with a Transformer decoder to generate binary masks and associated class probabilities, enabling high performance on large-vocabulary datasets and competitive panoptic results. Extensive experiments on ADE20K, COCO, Cityscapes, and Mapillary Vistas demonstrate state-of-the-art results and highlight the advantages of mask classification in terms of parameter efficiency and computation, especially as class counts grow. The work shows that a box-free, mask-centric formulation can surpass traditional per-pixel approaches while simplifying the segmentation landscape and enabling broader applicability.

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Paper Structure

This paper contains 21 sections, 1 equation, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Per-pixel classification vs. mask classification.(left) Semantic segmentation with per-pixel classification applies the same classification loss to each location. (right) Mask classification predicts a set of binary masks and assigns a single class to each mask. Each prediction is supervised with a per-pixel binary mask loss and a classification loss. Matching between the set of predictions and ground truth segments can be done either via bipartite matching similarly to DETR detr or by fixed matching via direct indexing if the number of predictions and classes match, i.e., if $N = K$.
  • Figure 2: MaskFormer overview. We use a backbone to extract image features $\mathcal{F}$. A pixel decoder gradually upsamples image features to extract per-pixel embeddings $\mathcal{E}_\text{pixel}$. A transformer decoder attends to image features and produces $N$ per-segment embeddings $\mathcal{Q}$. The embeddings independently generate $N$ class predictions with $N$ corresponding mask embeddings $\mathcal{E}_\text{mask}$. Then, the model predicts $N$ possibly overlapping binary mask predictions via a dot product between pixel embeddings $\mathcal{E}_\text{pixel}$ and mask embeddings $\mathcal{E}_\text{mask}$ followed by a sigmoid activation. For semantic segmentation task we can get the final prediction by combining $N$ binary masks with their class predictions using a simple matrix multiplication (see Section \ref{['sec:method:inference']}). Note, the dimensions for multiplication $\bigotimes$ are shown in gray.
  • Figure :
  • Figure :
  • Figure I: Visualization of "semantic" queries and "panoptic" queries. Unlike the behavior in a MaskFormer model trained for panoptic segmentation (right), a single query is used to capture multiple instances in a MaskFormer model trained for semantic segmentation (left). Our model has the capacity to adapt to different types of tasks given different ground truth annotations.
  • ...and 1 more figures