SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation
Duy-Kien Nguyen, Martin R. Oswald, Cees G. M. Snoek
TL;DR
The paper tackles the reliance on multi-scale feature pyramids in dense vision tasks by introducing SimPLR, a plain detector that shifts scale information into the attention mechanism. It employs a plain ViT backbone with MAE pretraining, a simple projection to feed a single-scale encoder-decoder, and scale-aware attention with fixed and adaptive variants using multiple anchors. Experiments on COCO and CityScapes demonstrate competitive object detection and segmentation performance with notably faster inference compared to hierarchical and multi-scale baselines, and show favorable scaling behavior with larger backbones and more pretraining data. The work suggests that many handcrafted multi-scale design choices in CNN-based detectors can be replaced by learnable scale-aware attention in transformer-based architectures.
Abstract
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more pre-training data, allowing us to report a consistently better accuracy and faster runtime for object detection, instance segmentation, as well as panoptic segmentation. Code is released at https://github.com/kienduynguyen/SimPLR.
