Table of Contents
Fetching ...

Learning Dynamic Local Context Representations for Infrared Small Target Detection

Guoyi Zhang, Guangsheng Xu, Han Wang, Siyang Chen, Yunxiao Shan, Xiaohu Zhang

TL;DR

Infrared small target detection (ISTD) is challenged by clutter, low signal-to-clutter ratios, and scale variation. The authors propose LCRNet, a lightweight U-Net–like model that learns dynamic local context representations through three components: Coarse-to-fine Convolution Block (C2FBlock), Dynamic Local Context Attention (DLC-Attention), and HLKConv for efficient large-kernel processing. Through a multigrid-inspired refinement, adaptive receptive-field allocation, and hierarchical large-kernel decomposition, LCRNet achieves state-of-the-art results on IRSTD-1K, SIRSTAUG, and NUDT-SIRST with only 1.65M parameters and low computational cost. Ablation studies validate the contributions of each component, and results indicate robust performance and practical efficiency, suggesting strong potential for real-time ISTD with further optimizations.

Abstract

Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes. Effective detection relies on capturing local contextual information at the appropriate scale. However, small-kernel CNNs have limited receptive fields, leading to false alarms, while transformer models, with global receptive fields, often treat small targets as noise, resulting in miss-detections. Hybrid models struggle to bridge the semantic gap between CNNs and transformers, causing high complexity.To address these challenges, we propose LCRNet, a novel method that learns dynamic local context representations for ISTD. The model consists of three components: (1) C2FBlock, inspired by PDE solvers, for efficient small target information capture; (2) DLC-Attention, a large-kernel attention mechanism that dynamically builds context and reduces feature redundancy; and (3) HLKConv, a hierarchical convolution operator based on large-kernel decomposition that preserves sparsity and mitigates the drawbacks of dilated convolutions. Despite its simplicity, with only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.Experiments on multiple datasets, comparing LCRNet with 33 SOTA methods, demonstrate its superior performance and efficiency.

Learning Dynamic Local Context Representations for Infrared Small Target Detection

TL;DR

Infrared small target detection (ISTD) is challenged by clutter, low signal-to-clutter ratios, and scale variation. The authors propose LCRNet, a lightweight U-Net–like model that learns dynamic local context representations through three components: Coarse-to-fine Convolution Block (C2FBlock), Dynamic Local Context Attention (DLC-Attention), and HLKConv for efficient large-kernel processing. Through a multigrid-inspired refinement, adaptive receptive-field allocation, and hierarchical large-kernel decomposition, LCRNet achieves state-of-the-art results on IRSTD-1K, SIRSTAUG, and NUDT-SIRST with only 1.65M parameters and low computational cost. Ablation studies validate the contributions of each component, and results indicate robust performance and practical efficiency, suggesting strong potential for real-time ISTD with further optimizations.

Abstract

Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes. Effective detection relies on capturing local contextual information at the appropriate scale. However, small-kernel CNNs have limited receptive fields, leading to false alarms, while transformer models, with global receptive fields, often treat small targets as noise, resulting in miss-detections. Hybrid models struggle to bridge the semantic gap between CNNs and transformers, causing high complexity.To address these challenges, we propose LCRNet, a novel method that learns dynamic local context representations for ISTD. The model consists of three components: (1) C2FBlock, inspired by PDE solvers, for efficient small target information capture; (2) DLC-Attention, a large-kernel attention mechanism that dynamically builds context and reduces feature redundancy; and (3) HLKConv, a hierarchical convolution operator based on large-kernel decomposition that preserves sparsity and mitigates the drawbacks of dilated convolutions. Despite its simplicity, with only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.Experiments on multiple datasets, comparing LCRNet with 33 SOTA methods, demonstrate its superior performance and efficiency.

Paper Structure

This paper contains 25 sections, 17 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparison of the proposed LCRNet with other data-driven methods on the IRSTD-1k dataset ISNet. The area of the colorful circles represents the number of FLOPs, while the diamonds indicate cases where FLOPs are unknown. Our LCRNet achieves a remarkable balance between computational efficiency and detection performance, setting a new SOTA.
  • Figure 2: Two key priors for infrared small target detection. (a) Small target information is confined to local regions of the image, and global observation often leads to missed detection of small targets. (b) Small target detection requires contextual information, where targets with varying shapes, sizes, and SCRs need different local contextual scales for accurate detection. As the context size increases, pixel-level details of the target are lost. Here, ERF denotes the size of the effective receptive field luo2016understanding.
  • Figure 3: The overall architecture of LCRNet follows the typical U-Net structure ho2020denoisingnichol2021improvedlugmayr2022repaint, with stacked C2FBlocks. The InitConv and OutConv are two 3×3 convolutions: InitConv increases the number of channels from 1 to $C_1$, while OutConv reduces the channels from $C_1$ back to 1. Specifically, $L^{*}_i$ = $L_i + 1$, with $L_1 = L_2 = L_3 = L_4 = 3$ and channel sizes $C_1 = 16$, $C_2 = 32$, $C_3 = 64$, and $C_4 = 64$, The total number of parameters is 1.65M, and the FLOPs is 59.3G.
  • Figure 4: The overall architecture of the proposed Coarse-to-fine Convolution Block (C2FBlock) simulates the iterative process of a multigrid method 10061442. C2FBlock effectively distinguishes between spatial and frequency-domain distributions of similar targets, background clutter, and noise.
  • Figure 5: Overall architecture of the proposed attention. For simplicity, we show proposed attention in cardinality-major view (the featuremap groups with same cardinal group index reside next to each other). We use radix-major zhang2022resnest in the real implementation, which can be modularized and accelerated by group convolution and standard CNN layers.
  • ...and 5 more figures