Table of Contents
Fetching ...

Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection

Zixuan Wang, Haoran Sun, Jiaming Lu, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Xuelin Qian, Junwei Han

TL;DR

IR perception in infrared imagery suffers from low signal-to-noise ratio and background clutter, and prior language-guided approaches suffer from annotation overhead and semantic mismatch. The authors propose DGSPNet, an end-to-end framework that integrates dual-granularity semantic prompts—coarse textual priors and fine-grained image-derived tokens via an inversion net—with a hierarchical image encoder and text-guided attention. A Text-Guide Channel Attention and a Text-Guide Spatial Attention mechanism fuse semantic cues into both the encoder and decoder to focus on potential targets. DGSPNet achieves state-of-the-art results on three IRSTD benchmarks, demonstrating the practical value of language-guided, low-SNR detection and suggesting a broader role for cross-modal prompting in infrared imaging.

Abstract

Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.

Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection

TL;DR

IR perception in infrared imagery suffers from low signal-to-noise ratio and background clutter, and prior language-guided approaches suffer from annotation overhead and semantic mismatch. The authors propose DGSPNet, an end-to-end framework that integrates dual-granularity semantic prompts—coarse textual priors and fine-grained image-derived tokens via an inversion net—with a hierarchical image encoder and text-guided attention. A Text-Guide Channel Attention and a Text-Guide Spatial Attention mechanism fuse semantic cues into both the encoder and decoder to focus on potential targets. DGSPNet achieves state-of-the-art results on three IRSTD benchmarks, demonstrating the practical value of language-guided, low-SNR detection and suggesting a broader role for cross-modal prompting in infrared imaging.

Abstract

Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.

Paper Structure

This paper contains 15 sections, 11 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of DGSPNet. The entire training process is divided into two phases: pre-training and normal training. During the pre-training phase, the network’s decoder is replaced with a reconstruction decoder, and contrastive loss is introduced to supervise the training of the inversion network. During the normal training phase, the weights of the inversion net are frozen. Each convolutional block consists of one convolution unit, a normalization layer, and a ReLU activation layer, and the deconvolutional block has the same composition.
  • Figure 2: Illustration of TGCA and TGSA. Guided by text features, these modules generate channel and spatial attention weights, which are applied via multiplication and residual connections to enhance visual features.