Table of Contents
Fetching ...

HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection

YangChen Zeng

TL;DR

This work tackles the challenge of small object detection in Transformer-based pipelines by introducing HeatMap Embedding (HMPE), a mechanism that dynamically couples positional encodings with detection semantics through heatmaps. By visualizing HMPE and integrating it into encoder and decoder components with MOHFE and HIDQ, the approach generates high-quality queries and reduces background noise, while LSConv enhances feature extraction for sparse, small targets. Empirical results on NWPU VHR-10 and PASCAL VOC show substantial improvements in mAP and mAP@0.95, along with a notable reduction in decoder complexity and training/inference costs. Overall, HMPE offers a practical, scalable pathway to more accurate and efficient small-object detection and suggests avenues for applying heatmap-guided embeddings to broader vision tasks.

Abstract

Current Transformer-based methods for small object detection continue emerging, yet they have still exhibited significant shortcomings. This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique that enhances object detection performance by dynamically integrating positional encoding with semantic detection information through heatmap-guided adaptive learning.We also innovatively visualize the HMPE method, offering clear visualization of embedded information for parameter fine-tuning.We then create Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) and HeatMap Induced High-Quality Queries for Decoder (HIDQ) modules. These are designed for the encoder and decoder, respectively, to generate high-quality queries and reduce background noise queries.Using both heatmap embedding and Linear-Snake Conv(LSConv) feature engineering, we enhance the embedding of massively diverse small object categories and reduced the decoder multihead layers, thereby accelerating both inference and training.In the generalization experiments, our approach outperforme the baseline mAP by 1.9% on the small object dataset (NWPU VHR-10) and by 1.2% on the general dataset (PASCAL VOC). By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.

HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection

TL;DR

This work tackles the challenge of small object detection in Transformer-based pipelines by introducing HeatMap Embedding (HMPE), a mechanism that dynamically couples positional encodings with detection semantics through heatmaps. By visualizing HMPE and integrating it into encoder and decoder components with MOHFE and HIDQ, the approach generates high-quality queries and reduces background noise, while LSConv enhances feature extraction for sparse, small targets. Empirical results on NWPU VHR-10 and PASCAL VOC show substantial improvements in mAP and mAP@0.95, along with a notable reduction in decoder complexity and training/inference costs. Overall, HMPE offers a practical, scalable pathway to more accurate and efficient small-object detection and suggests avenues for applying heatmap-guided embeddings to broader vision tasks.

Abstract

Current Transformer-based methods for small object detection continue emerging, yet they have still exhibited significant shortcomings. This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique that enhances object detection performance by dynamically integrating positional encoding with semantic detection information through heatmap-guided adaptive learning.We also innovatively visualize the HMPE method, offering clear visualization of embedded information for parameter fine-tuning.We then create Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) and HeatMap Induced High-Quality Queries for Decoder (HIDQ) modules. These are designed for the encoder and decoder, respectively, to generate high-quality queries and reduce background noise queries.Using both heatmap embedding and Linear-Snake Conv(LSConv) feature engineering, we enhance the embedding of massively diverse small object categories and reduced the decoder multihead layers, thereby accelerating both inference and training.In the generalization experiments, our approach outperforme the baseline mAP by 1.9% on the small object dataset (NWPU VHR-10) and by 1.2% on the general dataset (PASCAL VOC). By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.

Paper Structure

This paper contains 18 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Heatmap embedding visualized with heatbar.The heatmap embedding shows "hot" middle and "cold" ends, corresponding to "good" and "poor" embeddings.
  • Figure 2: The pipeline of the proposed method HeatMap Embedding.It consists of the HIDQ, MOHFE module, and Linear-Snake.The HMPE module applies norm, followed by upsampling, and employs a Mask Filter to produce high-quality queries while filtering out low-quality background noise.
  • Figure 3: Transformer Optimization with HMPE.
  • Figure 4: Dual-path complementary structure in a high dim structure.The green lines represent the feature extraction path for the snake section, while the blue lines denote the path for the linear section.The red square denotes the center of convolution. The left image illustrates the efficient extraction of complex features under the dual-path complementary structure, and the right image simulates the extraction processes along the x-axis and y-axis,respectively.
  • Figure 5: This demonstrates traditional convolution, dilated convolution, deformable convolutionzhu2019deformable, DSC convolutionqi2023dynamic, and two different variants of LSConv operating on a Linear-Snake over a 9x9 grid.Blue represents the simulated path of Lsconv convolution, while green indicates the simulated path of other convolutions.
  • ...and 1 more figures