Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection
Zhihua Shen, Siyang Chen, Han Wang, Tongsu Zhang, Xiaohu Zhang, Xiangpeng Xu, Xia Yang
TL;DR
This work tackles multi-frame infrared small target detection (IRSTD) by proposing LVNet, a simple yet effective CNN-Transformer hybrid. It introduces a multi-scale CNN frontend to enhance low-level local features, coupled with a U-shaped video Transformer backbone to model spatiotemporal context. LVNet achieves state-of-the-art pixel-level metrics on IRDST and NUDT-MIRSDT while using a fraction of the parameters and computational cost of prior methods, notably improving IoU by 6.76 percentage points and nIoU by 18.40 percentage points over LMAFormer. Ablation studies confirm the importance of low-level representation learning and the symmetric Transformer decoder, underscoring the practicality of low-level-centric hybrids for moving small-target detection; code and models are publicly available.
Abstract
Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63\% / 18.36\% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.
