Table of Contents
Fetching ...

It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations

Guoyi Zhang, Guangsheng Xu, Siyang Chen, Han Wang, Xiaohu Zhang

TL;DR

This work tackles infrared small target detection in cluttered scenes by addressing low SCR and target variability. It introduces LRRNet, a patch-free network that directly learns image-domain low-rank background priors through a compression–reconstruction–subtraction framework, preserving small-target cues while avoiding patch-based RPCA. The model employs a densely connected encoder–decoder with a single low-resolution self-attention and a subtraction-based target isolation, optimized by segmentation and reconstruction losses. Experiments on IRSTD-1K, SIRSTAUG, and NUDT-SIRST show state-of-the-art or competitive accuracy with real-time performance (83 FPS range) and robustness to sensor noise, highlighting the practical potential of interpretable low-rank priors for infrared small-target detection.

Abstract

\textcolor{blue}{This is the pre-acceptance version, to read the final version please go to \href{https://ieeexplore.ieee.org/document/11156113}{IEEE Transactions on Geoscience and Remote Sensing on IEEE Xplore}.} Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression--reconstruction--subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model's resilience to sensor noise. The source code will be made publicly available upon acceptance.

It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations

TL;DR

This work tackles infrared small target detection in cluttered scenes by addressing low SCR and target variability. It introduces LRRNet, a patch-free network that directly learns image-domain low-rank background priors through a compression–reconstruction–subtraction framework, preserving small-target cues while avoiding patch-based RPCA. The model employs a densely connected encoder–decoder with a single low-resolution self-attention and a subtraction-based target isolation, optimized by segmentation and reconstruction losses. Experiments on IRSTD-1K, SIRSTAUG, and NUDT-SIRST show state-of-the-art or competitive accuracy with real-time performance (83 FPS range) and robustness to sensor noise, highlighting the practical potential of interpretable low-rank priors for infrared small-target detection.

Abstract

\textcolor{blue}{This is the pre-acceptance version, to read the final version please go to \href{https://ieeexplore.ieee.org/document/11156113}{IEEE Transactions on Geoscience and Remote Sensing on IEEE Xplore}.} Infrared small target detection (IRSTD) remains a long-standing challenge in complex backgrounds due to low signal-to-clutter ratios (SCR), diverse target morphologies, and the absence of distinctive visual cues. While recent deep learning approaches aim to learn discriminative representations, the intrinsic variability and weak priors of small targets often lead to unstable performance. In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet, which leverages the low-rank property of infrared image backgrounds. Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression--reconstruction--subtraction (CRS) paradigm to directly model structure-aware low-rank background representations in the image domain, without relying on patch-based processing or explicit matrix decomposition. To the best of our knowledge, this is the first work to directly learn low-rank background structures using deep neural networks in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency. Remarkably, it achieves real-time performance with an average speed of 82.34 FPS. Evaluations on the challenging NoisySIRST dataset further confirm the model's resilience to sensor noise. The source code will be made publicly available upon acceptance.

Paper Structure

This paper contains 31 sections, 14 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of the proposed LRRNet with existing data-driven methods. Unlike prior models that focus solely on target feature learning, LRRNet integrates image-domain low-rank priors without patch decomposition, enabling interpretable and efficient small-target detection. (a) Methods such as DNANet DNANet, UIUNet UIUNet, and SeRankDet SeRankDet emphasize feature learning but struggle with limited robustness and interpretability due to target variability and the high complexity of deep-layer propagation. (b) RPCANet RPCANet introduces interpretability via deep unfolding, yet imposes inappropriate constraints directly on the image domain instead of on patch-images, leading to suboptimal performance and computational inefficiency. (c) In contrast, our LRRNet adopts a patch-free compress–reconstruct–subtract paradigm: ① Small-target cues are retained in shallow layers, reducing both model depth and computational cost; ② Low-rank priors enable robust modeling of cluttered backgrounds; ③ High-resolution subtraction isolates small targets effectively without the information loss associated with up/downsampling.
  • Figure 2: Important priors in infrared small target detection. The first row displays original infrared images from diverse scenes, and the second row shows the singular value distributions of their corresponding patch-image matrices gao2013infrared. Three key observations can be made. (1) The targets exhibit significant variation in shape and size, and often appear with low contrast against cluttered backgrounds CSRNet. This makes it difficult to robustly learn discriminative target-specific features LCRNet. (2) Despite differences in clutter types, such as clouds, terrain, sea surface, sky, or man-made structures, the singular value curves demonstrate a consistent trend as the number of patches increases. This indicates that background regions share similar structural properties, which are more suitable for generalized modeling. (3) In highly complex scenes, the linear low-rank assumption becomes insufficient to represent the intrinsic structure of patch-images. This highlights the need for nonlinear low-rank constraints to more accurately capture the characteristics of real-world infrared backgrounds Wright-Ma-2022.
  • Figure 3: Overview of the proposed LRRNet architecture, which follows the compression–reconstruction–subtraction paradigm. The network learns structure-aware low-rank background representations through a densely connected encoder–decoder backbone, enabling effective small target extraction by subtracting the reconstructed background from the input features. DB and UB denote downsampling and upsampling blocks at each stage, respectively, each consisting of a residual block followed by a downsampling or upsampling convolution. SA represents a self-attention module used to enhance feature representation. The entire network is jointly optimized by a segmentation loss and a reconstruction loss.
  • Figure 4: ROC curves of our LRRNet and other approaches on IRSTD-1k ISNet.
  • Figure 5: Qualitative evaluation for Our LRRNet. (a) Visualization comparison of detection results via different methods on representative images from IRSTD-1k dataset. The red, yellow, and cyan boxes denote correct detections, false alarms, and missed detections, respectively. (b) Visualization of the LRRNet results on the NoisySIRST dataset under Gaussian white noise interference with a variance of 30.
  • ...and 1 more figures