Table of Contents
Fetching ...

Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection

Zhihua Shen, Siyang Chen, Han Wang, Tongsu Zhang, Xiaohu Zhang, Xiangpeng Xu, Xia Yang

TL;DR

This work tackles multi-frame infrared small target detection (IRSTD) by proposing LVNet, a simple yet effective CNN-Transformer hybrid. It introduces a multi-scale CNN frontend to enhance low-level local features, coupled with a U-shaped video Transformer backbone to model spatiotemporal context. LVNet achieves state-of-the-art pixel-level metrics on IRDST and NUDT-MIRSDT while using a fraction of the parameters and computational cost of prior methods, notably improving IoU by 6.76 percentage points and nIoU by 18.40 percentage points over LMAFormer. Ablation studies confirm the importance of low-level representation learning and the symmetric Transformer decoder, underscoring the practicality of low-level-centric hybrids for moving small-target detection; code and models are publicly available.

Abstract

Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63\% / 18.36\% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.

Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection

TL;DR

This work tackles multi-frame infrared small target detection (IRSTD) by proposing LVNet, a simple yet effective CNN-Transformer hybrid. It introduces a multi-scale CNN frontend to enhance low-level local features, coupled with a U-shaped video Transformer backbone to model spatiotemporal context. LVNet achieves state-of-the-art pixel-level metrics on IRDST and NUDT-MIRSDT while using a fraction of the parameters and computational cost of prior methods, notably improving IoU by 6.76 percentage points and nIoU by 18.40 percentage points over LMAFormer. Ablation studies confirm the importance of low-level representation learning and the symmetric Transformer decoder, underscoring the practicality of low-level-centric hybrids for moving small-target detection; code and models are publicly available.

Abstract

Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63\% / 18.36\% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.

Paper Structure

This paper contains 25 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison of the proposed LVNet with SOTA methods on the IRDST datasetsun2023receptive. The area of the circles represents FLOPs. Our LVNet achieves a remarkable balance between computational efficiency and detection performance, setting a new SOTA.
  • Figure 2: The architecture of LVNet.
  • Figure 3: Visualization comparison of detection results via different methods on representative images from NUDT-MIRSDT dataset. Genuine targets are highlighted andmagnified in the lower left corner. Red circles are accurately detected targets, blue circles indicate missed detections, and yellow circles represent false alarms.
  • Figure 4: ROC curves of different methods on the NUDT-MIRSDT.