HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection
YangChen Zeng
TL;DR
This work tackles the challenge of small object detection in Transformer-based pipelines by introducing HeatMap Embedding (HMPE), a mechanism that dynamically couples positional encodings with detection semantics through heatmaps. By visualizing HMPE and integrating it into encoder and decoder components with MOHFE and HIDQ, the approach generates high-quality queries and reduces background noise, while LSConv enhances feature extraction for sparse, small targets. Empirical results on NWPU VHR-10 and PASCAL VOC show substantial improvements in mAP and mAP@0.95, along with a notable reduction in decoder complexity and training/inference costs. Overall, HMPE offers a practical, scalable pathway to more accurate and efficient small-object detection and suggests avenues for applying heatmap-guided embeddings to broader vision tasks.
Abstract
Current Transformer-based methods for small object detection continue emerging, yet they have still exhibited significant shortcomings. This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique that enhances object detection performance by dynamically integrating positional encoding with semantic detection information through heatmap-guided adaptive learning.We also innovatively visualize the HMPE method, offering clear visualization of embedded information for parameter fine-tuning.We then create Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) and HeatMap Induced High-Quality Queries for Decoder (HIDQ) modules. These are designed for the encoder and decoder, respectively, to generate high-quality queries and reduce background noise queries.Using both heatmap embedding and Linear-Snake Conv(LSConv) feature engineering, we enhance the embedding of massively diverse small object categories and reduced the decoder multihead layers, thereby accelerating both inference and training.In the generalization experiments, our approach outperforme the baseline mAP by 1.9% on the small object dataset (NWPU VHR-10) and by 1.2% on the general dataset (PASCAL VOC). By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.
