Table of Contents
Fetching ...

PEM: Prototype-based Efficient MaskFormer for Image Segmentation

Niccolò Cavagnero, Gabriele Rosi, Claudia Cuttano, Francesca Pistilli, Marco Ciccone, Giuseppe Averta, Fabio Cermelli

TL;DR

PEM tackles the efficiency bottleneck of transformer-based image segmentation by introducing a prototype-based cross-attention that reduces token interactions from $HW$ to $N$ object prototypes and a fully convolutional, context-modulated multi-scale pixel decoder. The approach preserves or improves performance on semantic and panoptic tasks across Cityscapes and ADE20K while delivering faster inference than strong baselines. Key contributions include the prototype selection mechanism, a lightweight cross-attention formulation, and an efficient FPN-based decoder with context modulation and deformable convolutions. The results demonstrate a favorable accuracy-speed trade-off, enabling practical deployment in resource-constrained settings without sacrificing cross-task versatility.

Abstract

Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.

PEM: Prototype-based Efficient MaskFormer for Image Segmentation

TL;DR

PEM tackles the efficiency bottleneck of transformer-based image segmentation by introducing a prototype-based cross-attention that reduces token interactions from to object prototypes and a fully convolutional, context-modulated multi-scale pixel decoder. The approach preserves or improves performance on semantic and panoptic tasks across Cityscapes and ADE20K while delivering faster inference than strong baselines. Key contributions include the prototype selection mechanism, a lightweight cross-attention formulation, and an efficient FPN-based decoder with context modulation and deformable convolutions. The results demonstrate a favorable accuracy-speed trade-off, enabling practical deployment in resource-constrained settings without sacrificing cross-task versatility.

Abstract

Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.
Paper Structure (16 sections, 12 equations, 9 figures, 7 tables)

This paper contains 16 sections, 12 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: PEM delivers comparable or superior performance in comparison to existing methods while being the fastest multi-task architecture for image segmentation.
  • Figure 2: Architecture of PEM with the three main components highlighted: backbone, pixel decoder and transformer decoder. The backbone extracts features from the input image; the pixel decoder provides features upsampling to extract high-resolution features; the transformer decoder, which takes as input a set of learnable queries and the high-resolution features to produce refined queries for inference.
  • Figure 3: Scheme of the proposed Prototype-based Masked Cross-Attention. The prototype selection mechanism reduces the token dimension from HW to N, the number of queries, significantly reducing the computational burden.
  • Figure 4: Latency comparison between PEM-CA and Masked Cross-Attention. PEM-CA scales better w.r.t. Masked Cross-Attention m2f when the input dimension increases. Note that, for Cityscapes images (1024$\times$2048 pixels), the features have dimensions 2048 ($F_4$), 8192 ($F_3$), 32768 ($F_2$), 131072 ($F_1$) pixels.
  • Figure 5: PQ versus latency on Cityscapes and ADE20K. We report performance and latency across different numbers of PEM transformer decoder blocks, ranging from zero to six.
  • ...and 4 more figures