Table of Contents
Fetching ...

SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer

Wenxi Li, Yuchen Guo, Jilai Zheng, Haozhe Lin, Chao Ma, Lu Fang, Xiaokang Yang

TL;DR

This work targets object detection in high-resolution wide (HRW) gigapixel scenes, where extreme sparsity and massive scale variation challenge traditional detectors. It introduces SparseFormer, a sparse Vision Transformer that uses ScoreNet to select informative regions and combines global attention on aggregated features with local attention on sparse windows. Cross-slice Non-Maximum Suppression (C-NMS) and a multi-scale training/inference pipeline address issues with oversized and tiny objects across slices and scales. Extensive experiments on PANDA and DOTA-v1.0 show significant improvements in both accuracy and efficiency, including notable FLOPs reductions while boosting AP, and strong performance on edge devices, demonstrating practical applicability for HRW-shot detection.

Abstract

Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.

SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer

TL;DR

This work targets object detection in high-resolution wide (HRW) gigapixel scenes, where extreme sparsity and massive scale variation challenge traditional detectors. It introduces SparseFormer, a sparse Vision Transformer that uses ScoreNet to select informative regions and combines global attention on aggregated features with local attention on sparse windows. Cross-slice Non-Maximum Suppression (C-NMS) and a multi-scale training/inference pipeline address issues with oversized and tiny objects across slices and scales. Extensive experiments on PANDA and DOTA-v1.0 show significant improvements in both accuracy and efficiency, including notable FLOPs reductions while boosting AP, and strong performance on edge devices, demonstrating practical applicability for HRW-shot detection.

Abstract

Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Performance comparison in terms of object size on the PANDA dataset wang2020panda. The horizontal axis indicates object sizes (the area of bounding boxes) on a logarithmic scale. The vertical axis shows detection accuracy per size. Both YOLOv8 yolov8 and DINO zhang2022dino underperform in handling extreme scale variations, especially for small and large objects. The proposed method performs well, achieving new state-of-the-art detection accuracy.
  • Figure 2: Featured detection example on PANDA. The state-of-the-art detectors, YOLOv8 yolov8 (blue) and DINO zhang2022dino (green), relying on fixed settings of the receptive field and anchors yield incomplete bounding boxes on a large bus and miss detections on a small car.
  • Figure 3: Pipeline of SparseFormer in one forward inference. First, we perform multi-scale slicing on a gigapixel image. Then, we apply patch partitioning to each slice, and group neighboring patches into windows. Global Attention utilizes aggregated features to quickly obtain coarse-grained information. Local Attention selects important windows to extract fine-grained information.
  • Figure 4: Network Architecture of SparseFormer. The red box represents the interaction range of attention.means tokens are updated by self-attention and means they remain unchanged. We partition the image into tokens and group them into windows. Global attention extracts coarse-grained features from all windows based on aggregated tokens and merges them with the original features. Local attention selects only the windows with complex details for fine-grained feature extraction through our ScoreNet, while the rest retain their original features to save computational resources.
  • Figure 5: Featured detection example on large objects with slicing aid. Detector yields two boxes based on overlapped slices. NMS, relying on the detection scores, will wrongly select the blue box for the kid.
  • ...and 2 more figures