SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer
Wenxi Li, Yuchen Guo, Jilai Zheng, Haozhe Lin, Chao Ma, Lu Fang, Xiaokang Yang
TL;DR
This work targets object detection in high-resolution wide (HRW) gigapixel scenes, where extreme sparsity and massive scale variation challenge traditional detectors. It introduces SparseFormer, a sparse Vision Transformer that uses ScoreNet to select informative regions and combines global attention on aggregated features with local attention on sparse windows. Cross-slice Non-Maximum Suppression (C-NMS) and a multi-scale training/inference pipeline address issues with oversized and tiny objects across slices and scales. Extensive experiments on PANDA and DOTA-v1.0 show significant improvements in both accuracy and efficiency, including notable FLOPs reductions while boosting AP, and strong performance on edge devices, demonstrating practical applicability for HRW-shot detection.
Abstract
Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
