Table of Contents
Fetching ...

AirShot: Efficient Few-Shot Detection for Autonomous Exploration

Zihan Wang, Bowen Li, Chen Wang, Sebastian Scherer

TL;DR

AirShot introduces Top Prediction Filter (TPF), a lightweight module that exploits correlation maps to enable fast, no-finetuning few-shot detection suitable for autonomous exploration. By training TPF with global and local representations to predict class existence and deploying a pre-selection during inference, AirShot achieves substantial efficiency gains while maintaining or improving detection accuracy on COCO, VOC, and SubT datasets. The approach reduces exhaustive class loops to a minor-loop strategy with adaptive or Top-N pre-selection, enabling real-time performance on low-power robots. Across extensive experiments and ablations, AirShot demonstrates improved precision (up to 36.4% AP gains reported) and notable inference speedups (up to 56.3% faster) without requiring offline fine-tuning, making it practically impactful for robotic perception in unseen environments.

Abstract

Few-shot object detection has drawn increasing attention in the field of robotic exploration, where robots are required to find unseen objects with a few online provided examples. Despite recent efforts have been made to yield online processing capabilities, slow inference speeds of low-powered robots fail to meet the demands of real-time detection-making them impractical for autonomous exploration. Existing methods still face performance and efficiency challenges, mainly due to unreliable features and exhaustive class loops. In this work, we propose a new paradigm AirShot, and discover that, by fully exploiting the valuable correlation map, AirShot can result in a more robust and faster few-shot object detection system, which is more applicable to robotics community. The core module Top Prediction Filter (TPF) can operate on multi-scale correlation maps in both the training and inference stages. During training, TPF supervises the generation of a more representative correlation map, while during inference, it reduces looping iterations by selecting top-ranked classes, thus cutting down on computational costs with better performance. Surprisingly, this dual functionality exhibits general effectiveness and efficiency on various off-the-shelf models. Exhaustive experiments on COCO2017, VOC2014, and SubT datasets demonstrate that TPF can significantly boost the efficacy and efficiency of most off-the-shelf models, achieving up to 36.4% precision improvements along with 56.3% faster inference speed. Code and Data are at: https://github.com/ImNotPrepared/AirShot.

AirShot: Efficient Few-Shot Detection for Autonomous Exploration

TL;DR

AirShot introduces Top Prediction Filter (TPF), a lightweight module that exploits correlation maps to enable fast, no-finetuning few-shot detection suitable for autonomous exploration. By training TPF with global and local representations to predict class existence and deploying a pre-selection during inference, AirShot achieves substantial efficiency gains while maintaining or improving detection accuracy on COCO, VOC, and SubT datasets. The approach reduces exhaustive class loops to a minor-loop strategy with adaptive or Top-N pre-selection, enabling real-time performance on low-power robots. Across extensive experiments and ablations, AirShot demonstrates improved precision (up to 36.4% AP gains reported) and notable inference speedups (up to 56.3% faster) without requiring offline fine-tuning, making it practically impactful for robotic perception in unseen environments.

Abstract

Few-shot object detection has drawn increasing attention in the field of robotic exploration, where robots are required to find unseen objects with a few online provided examples. Despite recent efforts have been made to yield online processing capabilities, slow inference speeds of low-powered robots fail to meet the demands of real-time detection-making them impractical for autonomous exploration. Existing methods still face performance and efficiency challenges, mainly due to unreliable features and exhaustive class loops. In this work, we propose a new paradigm AirShot, and discover that, by fully exploiting the valuable correlation map, AirShot can result in a more robust and faster few-shot object detection system, which is more applicable to robotics community. The core module Top Prediction Filter (TPF) can operate on multi-scale correlation maps in both the training and inference stages. During training, TPF supervises the generation of a more representative correlation map, while during inference, it reduces looping iterations by selecting top-ranked classes, thus cutting down on computational costs with better performance. Surprisingly, this dual functionality exhibits general effectiveness and efficiency on various off-the-shelf models. Exhaustive experiments on COCO2017, VOC2014, and SubT datasets demonstrate that TPF can significantly boost the efficacy and efficiency of most off-the-shelf models, achieving up to 36.4% precision improvements along with 56.3% faster inference speed. Code and Data are at: https://github.com/ImNotPrepared/AirShot.
Paper Structure (31 sections, 5 equations, 9 figures, 6 tables)

This paper contains 31 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Application sketch of AirShot. During training, we use TPF to increase the representation capability of correlation maps. When directly applied to Robot Explorer, TPF conducts pre-selection to enable minor loop inference instead of traditional full loops.
  • Figure 2: Working pipeline of AirDet. AirDet includes 3 modules, i.e., the shared backbone, feature fusion module for region proposal and shots aggregation, plus relation-based detection head.
  • Figure 3: Detailed working illustration of AirShot in training and inference stage. We adopt the backbone design of AirDet which contains backbone feature extractor, SCS for feature-fusion, relation-based shots aggregation and location regression.
  • Figure 4: Network architecture of TPF. We design two branches for global representation and local representation separately. Then the concatenated representation will be fed into a 3-layer MLP.
  • Figure 5: Ablation of TPF module regarding OR and efficiency (K=3)
  • ...and 4 more figures