Table of Contents
Fetching ...

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection

Shuai Yuan, Hanlin Qin, Xiang Yan, Naveed AKhtar, Ajmal Mian

TL;DR

SCTransNet addresses infrared small target detection by introducing Spatial-channel Cross Transformer Blocks (SCTB) that fuse multi-level encoder features through SSCA and CFN, connected along long-range skip paths. The SSCA module exchanges local spatial cues with global channel information, while CFN provides multi-scale spatial-channel enhancement to bridge encoder–decoder gaps. Comprehensive experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1k demonstrate superior IoU, nIoU, and F-measure with robust ROC behavior and reduced false alarms, supported by extensive ablations showing the contributions of SCTB, SSCA, CFN, and CCA. The work offers a scalable, transformer-based framework for IRSTD with practical impact and will release the code for reproducibility.

Abstract

Infrared small target detection (IRSTD) has recently benefitted greatly from U-shaped neural models. However, largely overlooking effective global information modeling, existing techniques struggle when the target has high similarities with the background. We present a Spatial-channel Cross Transformer Network (SCTransNet) that leverages spatial-channel cross transformer blocks (SCTBs) on top of long-range skip connections to address the aforementioned challenge. In the proposed SCTBs, the outputs of all encoders are interacted with cross transformer to generate mixed features, which are redistributed to all decoders to effectively reinforce semantic differences between the target and clutter at full scales. Specifically, SCTB contains the following two key elements: (a) spatial-embedded single-head channel-cross attention (SSCA) for exchanging local spatial features and full-level global channel information to eliminate ambiguity among the encoders and facilitate high-level semantic associations of the images, and (b) a complementary feed-forward network (CFN) for enhancing the feature discriminability via a multi-scale strategy and cross-spatial-channel information interaction to promote beneficial information transfer. Our SCTransNet effectively encodes the semantic differences between targets and backgrounds to boost its internal representation for detecting small infrared targets accurately. Extensive experiments on three public datasets, NUDT-SIRST, NUAA-SIRST, and IRSTD-1k, demonstrate that the proposed SCTransNet outperforms existing IRSTD methods. Our code will be made public at https://github.com/xdFai.

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection

TL;DR

SCTransNet addresses infrared small target detection by introducing Spatial-channel Cross Transformer Blocks (SCTB) that fuse multi-level encoder features through SSCA and CFN, connected along long-range skip paths. The SSCA module exchanges local spatial cues with global channel information, while CFN provides multi-scale spatial-channel enhancement to bridge encoder–decoder gaps. Comprehensive experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1k demonstrate superior IoU, nIoU, and F-measure with robust ROC behavior and reduced false alarms, supported by extensive ablations showing the contributions of SCTB, SSCA, CFN, and CCA. The work offers a scalable, transformer-based framework for IRSTD with practical impact and will release the code for reproducibility.

Abstract

Infrared small target detection (IRSTD) has recently benefitted greatly from U-shaped neural models. However, largely overlooking effective global information modeling, existing techniques struggle when the target has high similarities with the background. We present a Spatial-channel Cross Transformer Network (SCTransNet) that leverages spatial-channel cross transformer blocks (SCTBs) on top of long-range skip connections to address the aforementioned challenge. In the proposed SCTBs, the outputs of all encoders are interacted with cross transformer to generate mixed features, which are redistributed to all decoders to effectively reinforce semantic differences between the target and clutter at full scales. Specifically, SCTB contains the following two key elements: (a) spatial-embedded single-head channel-cross attention (SSCA) for exchanging local spatial features and full-level global channel information to eliminate ambiguity among the encoders and facilitate high-level semantic associations of the images, and (b) a complementary feed-forward network (CFN) for enhancing the feature discriminability via a multi-scale strategy and cross-spatial-channel information interaction to promote beneficial information transfer. Our SCTransNet effectively encodes the semantic differences between targets and backgrounds to boost its internal representation for detecting small infrared targets accurately. Extensive experiments on three public datasets, NUDT-SIRST, NUAA-SIRST, and IRSTD-1k, demonstrate that the proposed SCTransNet outperforms existing IRSTD methods. Our code will be made public at https://github.com/xdFai.
Paper Structure (23 sections, 14 equations, 12 figures, 10 tables)

This paper contains 23 sections, 14 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The framework and visualization maps of representative IRSTD methods, with each method's frame labeled according to the specific challenge it addresses. The visualization maps show that the CNN-based approaches (ACM 7, DNA-Net 8, and UIU-Net 10) focus on modeling the local information of the target and less on establishing the global semantic information of the image; Mixer CNN and transformer methods (MTU-Net 11 and our SCTransNet) pay more attention to the background global information and target semantics. Only our method meticulously models buildings and the sky separately in the high-level semantic map, effectively distinguishing the target from the background, reducing false alarms.
  • Figure 2: Overview of the proposed SCTransNet for infrared small object detection. Our SCTransNet adopts a U-shaped structure and adds four spatial-channel cross transformer blocks (SCTB) on the long-range skip connections, and the multi-scale deeply supervised fusion strategy is used to optimize our SCTransNet.
  • Figure 3: The proposed spatial-channel cross transformer block (SCTB), which consists of spatial-embedded single-head channel-cross attention (SSCA) and complementary feed-forward network (CFN). (a) SSCA establishes image full-scale information association by means of different levels of semantic interaction. (b) CFN bridges the semantic gap between encoder and decoder through complementary feature enhancement.
  • Figure 4: Information enhancement from different perspectives: (a) the local spatial and global channel (LSGC) paradigms; (b) the global spatial and local channel (GSLC) paradigms. Our CFN integrates both of these information enhancement methods internally.
  • Figure 5: ROC curves of different methods on the NUAA-SIRST, NUDT-SIRST, and IRSTD-1K dataset. Our SCTransNet can achieve the highest ${{P}_{d}}$ at very low ${{F}_{a}}$.
  • ...and 7 more figures