Neural Spatial-Temporal Tensor Representation for Infrared Small Target Detection
Fengyi Wu, Simin Liu, Haoan Wang, Bingjie Tao, Junhai Luo, Zhenming Peng
TL;DR
This work tackles infrared small target detection by linking optimization-based background modeling with neural implicit representations in a unified unsupervised framework. It introduces NeurSTT, which represents the background as a neural-revealed low-rank tensor $oldsymbol{ ext{B}}^{ ext{NLR}} = [oldsymbol{ ext{G}}; f_{ heta_h}, f_{ heta_w}, f_{ heta_t}]$, and enforces temporal-spatial regularity via a novel neural 3-D total variation (Neur3DTV). The method jointly estimates the background and targets through an alternating minimization that combines nuclear-norm, sparsity, and Neur3DTV losses, with targets recovered by a soft-thresholding step and all parameters updated by Adam. Empirical results on multiple ISTD datasets show that NeurSTT achieves higher IoU and F1 scores with significantly fewer parameters (up to 16.6× fewer) and stronger background suppression, demonstrating robust, unsupervised target detection in challenging scenes. The approach highlights the value of domain-informed neural representations for tensor-based background modeling in video ISTD, with potential applicability to other low-rank + sparse detection tasks.
Abstract
Optimization-based approaches dominate infrared small target detection as they leverage infrared imagery's intrinsic low-rankness and sparsity. While effective for single-frame images, they struggle with dynamic changes in multi-frame scenarios as traditional spatial-temporal representations often fail to adapt. To address these challenges, we introduce a Neural-represented Spatial-Temporal Tensor (NeurSTT) model. This framework employs nonlinear networks to enhance spatial-temporal feature correlations in background approximation, thereby supporting target detection in an unsupervised manner. Specifically, we employ neural layers to approximate sequential backgrounds within a low-rank informed deep scheme. A neural three-dimensional total variation is developed to refine background smoothness while reducing static target-like clusters in sequences. Traditional sparsity constraints are incorporated into the loss functions to preserve potential targets. By replacing complex solvers with a deep updating strategy, NeurSTT simplifies the optimization process in a domain-awareness way. Visual and numerical results across various datasets demonstrate that our method outperforms detection challenges. Notably, it has 16.6$\times$ fewer parameters and averaged 19.19\% higher in $IoU$ compared to the suboptimal method on $256 \times 256$ sequences.
