Table of Contents
Fetching ...

Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes

Feng Huang, Shuyuan Zheng, Zhaobing Qiu, Huanxian Liu, Huanxin Bai, Liqiong Chen

TL;DR

The paper tackles infrared small target detection (IRSTD) in cluttered, complex scenes where targets provide minimal visual cues. It introduces Text-IRSTD, a text-guided IRSTD framework that leverages fuzzy semantic prompts and a progressive cross-modal semantic interaction decoder (PCSID) to fuse text and image features, implemented via TGFA and TGSI blocks. A new FZDT dataset with 2,755 infrared images and fuzzy textual annotations is constructed to evaluate cross-modal performance, and experiments demonstrate state-of-the-art IoU, Pd, and contour recovery, including strong generalization to unseen scenarios. These findings show that incorporating semantic text significantly enhances IRSTD robustness and practicality, with code and dataset to be released post-acceptance.

Abstract

Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released after acceptance of this paper.

Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes

TL;DR

The paper tackles infrared small target detection (IRSTD) in cluttered, complex scenes where targets provide minimal visual cues. It introduces Text-IRSTD, a text-guided IRSTD framework that leverages fuzzy semantic prompts and a progressive cross-modal semantic interaction decoder (PCSID) to fuse text and image features, implemented via TGFA and TGSI blocks. A new FZDT dataset with 2,755 infrared images and fuzzy textual annotations is constructed to evaluate cross-modal performance, and experiments demonstrate state-of-the-art IoU, Pd, and contour recovery, including strong generalization to unseen scenarios. These findings show that incorporating semantic text significantly enhances IRSTD robustness and practicality, with code and dataset to be released post-acceptance.

Abstract

Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released after acceptance of this paper.

Paper Structure

This paper contains 17 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visual comparison between the SOTA model and proposed Text-IRSTD. (a) Detection results of the SOTA model yuan2024sctransnet, which relies only on visual features. (b) Detection results of proposed Text-IRSTD, which uses both textual and visual features.
  • Figure 2: (a) Typical semantic text prompt for generic target detection. (b) Proposed fuzzy semantic text prompt for IRSTD.
  • Figure 3: Overview of the proposed Text-IRSTD. It consists of three main components: text encoder, image encoder, and cross-modal decoder PCSID with their outputs represented as $E_{text}$, $E_{img}^{(i)}$, and $D_{cm}$, respectively. Where DPM denotes detail perception module, TGFA Block denotes text-guided feature aggregation block and TGSI Block denotes text-guided semantic interaction block.
  • Figure 4: Structure of TGFA block,where DPM denotes detail perception module and CA denotes channel attention.
  • Figure 5: Visual results of different IRSTD methods. The enlarged view of the target is shown in image corners. The red, blue, and yellow boxes represent correctly detected targets, missed targets, and false detections, respectively.
  • ...and 1 more figures