Table of Contents
Fetching ...

Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection

Dengyan Luo, Yanping Xiang, Hu Wang, Luping Ji, Shuai Li, Mao Ye

TL;DR

DFAR addresses explicit motion compensation for moving infrared dim-small target detection by integrating a Temporal Deformable Alignment (TDA) module built on Dilated Convolution Attention Fusion (DCAF) blocks and a Feature Refinement (FR) module with Attention-guided Deformable Fusion (AGDF), plus a Motion Compensation loss to supervise temporal alignment. The approach enables explicit alignment of adjacent frames at the feature level during both training and inference, capturing complex motion with a two-stage deformable mechanism and adaptive fusion. On the DAUB and IRDST datasets, DFAR achieves state-of-the-art results in $mAP_{50}$, precision, recall, and $F1$, outperforming SSTNet and other baselines while offering improved inference efficiency. This work demonstrates that deformable, attention-guided motion modeling can significantly enhance detection of moving infrared dim-small targets in challenging scenes.

Abstract

The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensation, resulting in poor performance in the case of a video sequence including large motion. In this paper, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages. Specifically, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Then, the feature refinement module adaptively fuses the aligned features and further aggregates useful spatio-temporal information by means of the proposed Attention-guided Deformable Fusion (AGDF) block. In addition, to improve the alignment of adjacent frames with the current frame, we extend the traditional loss function by introducing a new motion compensation loss. Extensive experimental results demonstrate that the proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.

Deformable Feature Alignment and Refinement for Moving Infrared Dim-small Target Detection

TL;DR

DFAR addresses explicit motion compensation for moving infrared dim-small target detection by integrating a Temporal Deformable Alignment (TDA) module built on Dilated Convolution Attention Fusion (DCAF) blocks and a Feature Refinement (FR) module with Attention-guided Deformable Fusion (AGDF), plus a Motion Compensation loss to supervise temporal alignment. The approach enables explicit alignment of adjacent frames at the feature level during both training and inference, capturing complex motion with a two-stage deformable mechanism and adaptive fusion. On the DAUB and IRDST datasets, DFAR achieves state-of-the-art results in , precision, recall, and , outperforming SSTNet and other baselines while offering improved inference efficiency. This work demonstrates that deformable, attention-guided motion modeling can significantly enhance detection of moving infrared dim-small targets in challenging scenes.

Abstract

The detection of moving infrared dim-small targets has been a challenging and prevalent research topic. The current state-of-the-art methods are mainly based on ConvLSTM to aggregate information from adjacent frames to facilitate the detection of the current frame. However, these methods implicitly utilize motion information only in the training stage and fail to explicitly explore motion compensation, resulting in poor performance in the case of a video sequence including large motion. In this paper, we propose a Deformable Feature Alignment and Refinement (DFAR) method based on deformable convolution to explicitly use motion context in both the training and inference stages. Specifically, a Temporal Deformable Alignment (TDA) module based on the designed Dilated Convolution Attention Fusion (DCAF) block is developed to explicitly align the adjacent frames with the current frame at the feature level. Then, the feature refinement module adaptively fuses the aligned features and further aggregates useful spatio-temporal information by means of the proposed Attention-guided Deformable Fusion (AGDF) block. In addition, to improve the alignment of adjacent frames with the current frame, we extend the traditional loss function by introducing a new motion compensation loss. Extensive experimental results demonstrate that the proposed DFAR method achieves the state-of-the-art performance on two benchmark datasets including DAUB and IRDST.
Paper Structure (21 sections, 24 equations, 12 figures, 4 tables)

This paper contains 21 sections, 24 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Illustrating the differences between our method and the state-of-the-art method SSTNet chen2024sstnet. The LSTM-based SSTNet implicitly aggregates the information from adjacent frames only in the training stage, while our method utilizes deformable convolution (DCN) to perform explicit inter-frame feature alignment in the training and inference stages, and applies the introduced Motion Compensation (MC) loss to supervise the temporal alignment.
  • Figure 2: The framework of the proposed DFAR approach consists of four parts. (a) A feature extraction module is applied to extract the spatial information from the input clip $I_{[t-R,t+R]}$ and obtain the extracted features $F_{[t-R,t+R]}^{E}$. (b) The extracted visual features $F_{i}^{E}$ and $F_{t}^{E}$ are concatenated in the channel dimension and fed into the Temporal Deformable Alignment (TDA) module based on the Dilated Convolution Attention Fusion (DCAF) blocks for alignment, $i \in[t-R,t+R] \text{ and } i \neq t$. (c) The aligned features and the extracted target feature are input into the feature refinement module based on the attention weight block and the designed Attention-guided Deformable Fusion (AGDF) blocks to adaptively fuse and refine spatio-temporal information. (d) The refined feature $F_D$ is fed into the detection head module for calculating the detection loss. The network is optimized under the supervision of the traditional detection loss and the introduced Motion Compensation (MC) loss. Herein, temporal radius $R = 2$.
  • Figure 3: An example of moving infrared dim-small target. The target is more easily perceived in the adjacent frame 806 than in the current frame 807 and more difficult to be detected in the adjacent frame 808.
  • Figure 4: The structure of Dilated Convolution Attention Fusion (DCAF) block. It contains 4 dilated convolutions with a dilation rate from 1 to 4. With no special indication, the kernel size and dilation rate of the convolutional layer are set to $3 \times 3$ and 1, respectively.
  • Figure 5: An example of visualizing feature maps. The feature maps are obtained by averaging all corresponding channel features. The white area is the anchor area to be aligned, and the red areas indicate the target areas. After alignment, the target features in adjacent frames are closer to the target features in the detected frame and become more easily perceived and utilized. Zoom in for the best view.
  • ...and 7 more figures