Table of Contents
Fetching ...

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

Duy-Kien Nguyen, Martin R. Oswald, Cees G. M. Snoek

TL;DR

The paper tackles the reliance on multi-scale feature pyramids in dense vision tasks by introducing SimPLR, a plain detector that shifts scale information into the attention mechanism. It employs a plain ViT backbone with MAE pretraining, a simple projection to feed a single-scale encoder-decoder, and scale-aware attention with fixed and adaptive variants using multiple anchors. Experiments on COCO and CityScapes demonstrate competitive object detection and segmentation performance with notably faster inference compared to hierarchical and multi-scale baselines, and show favorable scaling behavior with larger backbones and more pretraining data. The work suggests that many handcrafted multi-scale design choices in CNN-based detectors can be replaced by learnable scale-aware attention in transformer-based architectures.

Abstract

The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at https://github.com/kienduynguyen/SimPLR.

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation

TL;DR

The paper tackles the reliance on multi-scale feature pyramids in dense vision tasks by introducing SimPLR, a plain detector that shifts scale information into the attention mechanism. It employs a plain ViT backbone with MAE pretraining, a simple projection to feed a single-scale encoder-decoder, and scale-aware attention with fixed and adaptive variants using multiple anchors. Experiments on COCO and CityScapes demonstrate competitive object detection and segmentation performance with notably faster inference compared to hierarchical and multi-scale baselines, and show favorable scaling behavior with larger backbones and more pretraining data. The work suggests that many handcrafted multi-scale design choices in CNN-based detectors can be replaced by learnable scale-aware attention in transformer-based architectures.

Abstract

The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at https://github.com/kienduynguyen/SimPLR.
Paper Structure (10 sections, 4 equations, 7 figures, 6 tables)

This paper contains 10 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Object detection architectures. Left: The plain-backbone detector from li2022vitdet whose input (denoted in the dashed region) are multi-scale features. Middle: State-of-the-art end-to-end detectors nguyen2022boxerli2023maskdinohu2023dacdetr utilize a hierarchical backbone (i.e., Swin liu2021swintransformer) to create multi-scale inputs. Right: Our simple single-scale detector following the end-to-end framework. Where existing detectors require feature pyramids to be effective, we propose a plain detector, SimPLR, whose backbone and detection head are non-hierarchical and operate on a single-scale feature map. The plain detector, SimPLR, performs on par or even better compared to hierarchical and/or multi-scale counterparts, while also being more scaling-efficient.
  • Figure 2: Illustration of the proposed adaptive-scale attention. In the $i$-th attention head, given a query vector and anchors of three scales, the adaptive-scale attention learns to attend to a region of interest w.r.t. each scale. It then generates attention weights on these scales adaptively based on the query vector to produce $\mathrm{head}_i$. As this mechanism allows each vector in our plain feature map to learn suitable scale information during the training process, we no longer need an hierarchical backbone along with feature pyramids.
  • Figure 3: Masked Instance-Attention. Left: Box-attention nguyen2022boxer samples $2\times2$ grid features in the region of interest. Right: Proposed masked instance-attention for dense grid sampling. The $2\times2$ attention scores are denoted in four colours and the masked attention scores are in white. The masked instance-attention preserves the efficiency by sparse sampling while better capturing objects of different shapes.
  • Figure 4: The creation of input features. Left: The creation of feature pyramids from the last feature of the plain backbone, ViT, in SimpleFPN li2022vitdet where different stacks of convolutional layers are used to create features at different scales. Right: The design of our single-scale feature map with only one layer.
  • Figure 5: Scaling comparison. We compare our plain detector, SimPLR, with recent multi-scale detectors including both end-to-end detector like Mask2Former and plain-backbone detector like ViTDet. Larger circles indicate bigger models (Base, Large, Huge). Because of its plain and simple architecture, SimPLR performance scales with size. Our biggest model achieves stronger performance and faster runtime than all its multi-scale counterparts.
  • ...and 2 more figures