Table of Contents
Fetching ...

Improved Dense Nested Attention Network Based on Transformer for Infrared Small Target Detection

Chun Bao, Jie Cao, Yaqian Ning, Tianhua Zhao, Zhijun Li, Zechen Wang, Li Zhang, Qun Hao

TL;DR

The paper tackles infrared small target detection in cluttered scenes where target features fade with depth in CNNs. It develops IDNANet, a transformer-based enhancement of DNANet that uses Swin-T v2 for robust feature extraction, augmented by ACmix inter-layer attention and a weighted dice-bce loss to address class imbalance. A new BIT-SIRST dataset is introduced, combining real and synthetic data with contour-ground-truth to improve generalization. Across open datasets and BIT-SIRST, IDNANet achieves state-of-the-art performance in detection accuracy and false-alarm control, validating the approach and dataset contribution for practical infrared surveillance tasks.

Abstract

Infrared small target detection based on deep learning offers unique advantages in separating small targets from complex and dynamic backgrounds. However, the features of infrared small targets gradually weaken as the depth of convolutional neural network (CNN) increases. To address this issue, we propose a novel method for detecting infrared small targets called improved dense nested attention network (IDNANet), which is based on the transformer architecture. We preserve the dense nested structure of dense nested attention network (DNANet) and introduce the Swin-transformer during feature extraction stage to enhance the continuity of features. Furthermore, we integrate the ACmix attention structure into the dense nested structure to enhance the features of intermediate layers. Additionally, we design a weighted dice binary cross-entropy (WD-BCE) loss function to mitigate the negative impact of foreground-background imbalance in the samples. Moreover, we develop a dataset specifically for infrared small targets, called BIT-SIRST. The dataset comprises a significant amount of real-world targets and manually annotated labels, as well as synthetic data and corresponding labels. We have evaluated the effectiveness of our method through experiments conducted on public datasets. In comparison to other state-of-the-art methods, our approach outperforms in terms of probability of detection ($P_d$), false-alarm rate ($F_a$), and mean intersection of union ($mIoU$). The $mIoU$ reaches 90.89\% on the NUDT-SIRST dataset and 79.72\% on the SIRST dataset. The BIT-SIRST dataset and codes are available openly at \href{https://github.com/EdwardBao1006/bit\_sirst}{\color[HTML]{B22222}{https://github.com/EdwardBao1006/bit\_sirst}}.

Improved Dense Nested Attention Network Based on Transformer for Infrared Small Target Detection

TL;DR

The paper tackles infrared small target detection in cluttered scenes where target features fade with depth in CNNs. It develops IDNANet, a transformer-based enhancement of DNANet that uses Swin-T v2 for robust feature extraction, augmented by ACmix inter-layer attention and a weighted dice-bce loss to address class imbalance. A new BIT-SIRST dataset is introduced, combining real and synthetic data with contour-ground-truth to improve generalization. Across open datasets and BIT-SIRST, IDNANet achieves state-of-the-art performance in detection accuracy and false-alarm control, validating the approach and dataset contribution for practical infrared surveillance tasks.

Abstract

Infrared small target detection based on deep learning offers unique advantages in separating small targets from complex and dynamic backgrounds. However, the features of infrared small targets gradually weaken as the depth of convolutional neural network (CNN) increases. To address this issue, we propose a novel method for detecting infrared small targets called improved dense nested attention network (IDNANet), which is based on the transformer architecture. We preserve the dense nested structure of dense nested attention network (DNANet) and introduce the Swin-transformer during feature extraction stage to enhance the continuity of features. Furthermore, we integrate the ACmix attention structure into the dense nested structure to enhance the features of intermediate layers. Additionally, we design a weighted dice binary cross-entropy (WD-BCE) loss function to mitigate the negative impact of foreground-background imbalance in the samples. Moreover, we develop a dataset specifically for infrared small targets, called BIT-SIRST. The dataset comprises a significant amount of real-world targets and manually annotated labels, as well as synthetic data and corresponding labels. We have evaluated the effectiveness of our method through experiments conducted on public datasets. In comparison to other state-of-the-art methods, our approach outperforms in terms of probability of detection (), false-alarm rate (), and mean intersection of union (). The reaches 90.89\% on the NUDT-SIRST dataset and 79.72\% on the SIRST dataset. The BIT-SIRST dataset and codes are available openly at \href{https://github.com/EdwardBao1006/bit\_sirst}{\color[HTML]{B22222}{https://github.com/EdwardBao1006/bit\_sirst}}.
Paper Structure (28 sections, 10 equations, 8 figures, 8 tables)

This paper contains 28 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The overview architecture of the IDNANet. The overall network mainly consists of several components: Input Image, U-Shape Feature Extraction, Head, and Prediction. Where the input infrared image ${\bf{X}} \in {{\mathbb{R}}^{3\times H \times W}}$. ${{\bf{F}}^{(i,j)}}$ represents the feature map of the U-Shape network at position $(i, j)$. $H$ and $W$ represent the height and width of the image, respectively.
  • Figure 2: The structure of the Swin-T block. In light of the shortcomings of feature dissipation that arise in the feature extraction stage when using CNN-based methods, we utilize the backbone network of Swin-T v2 in this study. Specifically, we conduct patch embedding and position embedding operations on infrared images containing small targets. The processed image is then fed into the standard transformer process of encoder and decoder.
  • Figure 3: The architecture of ACmix block. In this block, we employ the feature maps of adjacent components of each node as input. This approach not only facilitates cross-layer information exchange between feature maps but also enhances features extracted by Swin-T block.
  • Figure 4: The structure of the feature pyramid fusion head. Where, we utilize four different scales of saliency maps as input for the head segment. Unlike traditional eight-neighborhood clustering segmentation, we directly employ a loss function to impose constraints and achieve end-to-end segmentation.
  • Figure 5: Representative infrared images from the BIT-SIRST dataset with various backgrounds. To enhance visibility, the demarcated area is enlarged, making it easier to see when zoomed in on a computer screen. The collected infrared small target images are numbered (1)-(20).
  • ...and 3 more figures