Table of Contents
Fetching ...

Unleashing the Power of Generic Segmentation Models: A Simple Baseline for Infrared Small Target Detection

Mingjin Zhang, Chi Zhang, Qiming Zhang, Yunsong Li, Xinbo Gao, Jing Zhang

TL;DR

This paper addresses infrared small target detection (IRSTD) by bridging IRSTD with generic segmentation models. It investigates the Segment Anything Model (SAM) and derivatives, finding they can match state-of-the-art IRSTD performance but are prone to overfitting when finetuned on IRSTD data. The authors propose a simple, distilled baseline that transfers knowledge from Semantic-SAM to a lightweight RepViT-based encoder, augmented with a tiny FPN and a novel dense-sparse query design for multi-scale feature fusion. Through distillation and cross-level interactions, the method achieves state-of-the-art results across four IRSTD datasets, notably improving IoU by substantial margins on NUDT and IRSTD1k, while maintaining high throughput. The work demonstrates the practical viability of leveraging generic segmentation foundations for IRSTD and provides code and models for reproducibility and further research.

Abstract

Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model's performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released at https://github.com/O937-blip/SimIR.

Unleashing the Power of Generic Segmentation Models: A Simple Baseline for Infrared Small Target Detection

TL;DR

This paper addresses infrared small target detection (IRSTD) by bridging IRSTD with generic segmentation models. It investigates the Segment Anything Model (SAM) and derivatives, finding they can match state-of-the-art IRSTD performance but are prone to overfitting when finetuned on IRSTD data. The authors propose a simple, distilled baseline that transfers knowledge from Semantic-SAM to a lightweight RepViT-based encoder, augmented with a tiny FPN and a novel dense-sparse query design for multi-scale feature fusion. Through distillation and cross-level interactions, the method achieves state-of-the-art results across four IRSTD datasets, notably improving IoU by substantial margins on NUDT and IRSTD1k, while maintaining high throughput. The work demonstrates the practical viability of leveraging generic segmentation foundations for IRSTD and provides code and models for reproducibility and further research.

Abstract

Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model's performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released at https://github.com/O937-blip/SimIR.
Paper Structure (26 sections, 3 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 3 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The pipeline of our model. First, the pre-trained image encoder takes infrared images as input and generates latent feature maps at four scales. These feature maps are passed through an FPN for bottom-up information aggregation. The decoder takes the output of FPN and makes mask predictions. Further, we incorporate a novel query design in our model for better cross-level information propagation.
  • Figure 2: The proposed distillation framework. The modules in blue are frozen during the distillation process, while the modules in red are trainable.
  • Figure 3: Details of decoding. Our model employs a multi-stage approach for mask predictions. First, after the image encoder, the sparse encoder queries $\textbf{Q}_{encoder}$, updated through a two-layer MLP, interact with dense queries $\textbf{Q}_{dense}$ updated via a convolutional layer to generate early predictions. Subsequently, following the FPN, the processed queries $\textbf{Q}_{FPN}$ are combined with the FPN output to produce intermediate predictions. In the final stage, the $\textbf{Q}_{encoder}$, $\textbf{Q}_{FPN}$ and $\textbf{Q}_{decoder}$ are incorporated into the modified SAM decoder. After interacting with image features through a two-way transformer, $\textbf{Q}_{encoder}$ and $\textbf{Q}_{FPN}$ are discarded, and the decoder makes mask predictions with a spatially point-wise product between mask features and $\textbf{Q}_{decoder}$ updated by MLP.
  • Figure 4: The ablation study on the proposed query design. (a) Heatmaps of P3 and P2 stages before learned queries are applied. (b) Heatmaps of P3 and P2 stages after learned queries applied. (c) Specific location of P3 and P2 in FPN.
  • Figure 5: The comparisons of the IoU and throughput on a single Nvidia GeForce 4090 GPU. The circle size refers to the model size. Batch size is set to 1, and the experiments are conducted on the IRSTD1k dataset.