Table of Contents
Fetching ...

SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

Wenhao Xu, Shuchen Zheng, Changwei Wang, Zherui Zhang, Chuan Ren, Rongtao Xu, Shibiao Xu

TL;DR

This work tackles infrared small target detection (ISTD), where targets occupy minuscule image areas and blend into cluttered backgrounds. It introduces SAMamba, which combines SAM2's hierarchical features with Vision Mamba-inspired selective sequence modeling, augmented by the FS-Adapter for domain-aware feature selection, the CSI module for efficient long-range context, and the DPCF fusion strategy to preserve fine details during multi-scale fusion. Empirical results on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST show SAMamba achieving state-of-the-art IoU, nIoU, and F1 scores, including strong performance on synthetic, highly challenging scenes. The approach offers a robust, computation-aware solution for ISTD with practical implications for long-range surveillance and autonomous systems, while highlighting avenues for temporal, hardware-optimized, and multi-modal extensions.

Abstract

Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.

SAMamba: Adaptive State Space Modeling with Hierarchical Vision for Infrared Small Target Detection

TL;DR

This work tackles infrared small target detection (ISTD), where targets occupy minuscule image areas and blend into cluttered backgrounds. It introduces SAMamba, which combines SAM2's hierarchical features with Vision Mamba-inspired selective sequence modeling, augmented by the FS-Adapter for domain-aware feature selection, the CSI module for efficient long-range context, and the DPCF fusion strategy to preserve fine details during multi-scale fusion. Empirical results on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST show SAMamba achieving state-of-the-art IoU, nIoU, and F1 scores, including strong performance on synthetic, highly challenging scenes. The approach offers a robust, computation-aware solution for ISTD with practical implications for long-range surveillance and autonomous systems, while highlighting avenues for temporal, hardware-optimized, and multi-modal extensions.

Abstract

Infrared small target detection (ISTD) is vital for long-range surveillance in military, maritime, and early warning applications. ISTD is challenged by targets occupying less than 0.15% of the image and low distinguishability from complex backgrounds. Existing deep learning methods often suffer from information loss during downsampling and inefficient global context modeling. This paper presents SAMamba, a novel framework integrating SAM2's hierarchical feature learning with Mamba's selective sequence modeling. Key innovations include: (1) A Feature Selection Adapter (FS-Adapter) for efficient natural-to-infrared domain adaptation via dual-stage selection (token-level with a learnable task embedding and channel-wise adaptive transformations); (2) A Cross-Channel State-Space Interaction (CSI) module for efficient global context modeling with linear complexity using selective state space modeling; and (3) A Detail-Preserving Contextual Fusion (DPCF) module that adaptively combines multi-scale features with a gating mechanism to balance high-resolution and low-resolution feature contributions. SAMamba addresses core ISTD challenges by bridging the domain gap, maintaining fine-grained details, and efficiently modeling long-range dependencies. Experiments on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets show SAMamba significantly outperforms state-of-the-art methods, especially in challenging scenarios with heterogeneous backgrounds and varying target scales. Code: https://github.com/zhengshuchen/SAMamba.

Paper Structure

This paper contains 19 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Challenges in infrared small target detection. (a) Targets are too small, making them prone to being missed during detection. (b) Targets have low distinguishability from the background.
  • Figure 2: Overview of the proposed SAMamba framework. The architecture consists of three key components: Feature Selection Adapter (FS-Adapter) for domain-specific feature extraction,Cross-Channel State-Space Interaction (CSI) for long-range dependency modeling, and Detail-Preserving Contextual Fusion (DPCF) for multi-scale feature aggregation.
  • Figure 3: Architecture of the Cross-Channel State-Space Interaction (CSI) module. The module integrates Vision Mamba blocks for efficient sequence modeling with cross-channel feature recombination and dual-attention refinement for enhanced target-background discrimination.
  • Figure 4: Illustration of the Detail-Preserving Contextual Fusion (DPCF) module. Low-res features ($\mathbf{F}_l$) are upsampled (interpolated) to match high-res features ($\mathbf{F}_h$). Both are split channel-wise into four segments. A learnable parameter $\alpha$ generates spatial gating weights $\beta$ via sigmoid. These gates control the weighted sum of corresponding high-res ($\mathbf{h}_i$) and low-res ($\mathbf{l}_i$) segments. The fused segments ($\mathbf{o}'_i$) are concatenated and refined by a final convolution block.
  • Figure 5: Visual examples of representative methods are provided. Pink and green circles represent true-positive and false-positive objects, respectively. Objects marked within the pink rectangles are zoomed in for a clearer comparison of detection accuracy among different methods.
  • ...and 2 more figures