Table of Contents
Fetching ...

LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection

Guoyi Zhang, Siyang Chen, Guangsheng Xu, Han Wang, Donghe Wang, Xiaohu Zhang

TL;DR

This work tackles the challenge of adapting vision foundation models like SAM for infrared small target detection by identifying a texture bias that hinders shape-agnostic localization. It introduces Ladder Shape-Biased Side-Tuning (LSP-ST), a dual-branch framework that freezes the SAM2 backbone while training a shape-biased HDConv side branch, connected via memory-efficient unidirectional links and complemented by a UNet-style decoder. Central to the method is Shape-Enhanced Large-kernel Attention (SELKA), which uses multi-branch SConv operations to form a Gaussian-like ERF that captures both internal target distributions and boundaries without explicit edge priors, supported by a theoretical matched-filter analysis and PSF-aligned receptive-field arguments. Empirically, LSP-ST achieves state-of-the-art results with only 4.72M learnable parameters on infrared small target benchmarks and demonstrates strong generalization to texture-insensitive tasks, while maintaining performance on texture-driven tasks, highlighting its practical impact for scalable, shape-aware adaptation of foundation models.

Abstract

Fine-tuning the Segment Anything Model (SAM) for infrared small target detection poses significant challenges due to severe domain shifts. Existing adaptation methods often incorporate handcrafted priors to bridge this gap, yet such designs limit generalization and scalability. We identify a fundamental texture bias in foundation models, which overly depend on local texture cues for target localization. To address this, we propose Ladder Shape-Biased Side-Tuning (LSP-ST), a novel approach that introduces a shape-aware inductive bias to facilitate effective adaptation beyond texture cues. In contrast to prior work that injects explicit edge or contour features, LSP-ST models shape as a global structural prior, integrating both boundaries and internal layouts. We design a Shape-Enhanced Large-Kernel Attention Module to hierarchically and implicitly capture structural information in a fully differentiable manner, without task-specific handcrafted guidance. A theoretical analysis grounded in matched filtering and backpropagation reveals the mechanism by which the proposed attention improves structure-aware learning. With only 4.72M learnable parameters, LSP-ST achieves state-of-the-art performance on multiple infrared small target detection benchmarks. Furthermore, its strong generalization is validated across tasks such as mirror detection, shadow detection, and camouflaged object detection, while maintaining stable performance on texture-driven tasks like salient object detection, demonstrating that the introduced shape bias complements rather than competes with texture-based reasoning.

LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection

TL;DR

This work tackles the challenge of adapting vision foundation models like SAM for infrared small target detection by identifying a texture bias that hinders shape-agnostic localization. It introduces Ladder Shape-Biased Side-Tuning (LSP-ST), a dual-branch framework that freezes the SAM2 backbone while training a shape-biased HDConv side branch, connected via memory-efficient unidirectional links and complemented by a UNet-style decoder. Central to the method is Shape-Enhanced Large-kernel Attention (SELKA), which uses multi-branch SConv operations to form a Gaussian-like ERF that captures both internal target distributions and boundaries without explicit edge priors, supported by a theoretical matched-filter analysis and PSF-aligned receptive-field arguments. Empirically, LSP-ST achieves state-of-the-art results with only 4.72M learnable parameters on infrared small target benchmarks and demonstrates strong generalization to texture-insensitive tasks, while maintaining performance on texture-driven tasks, highlighting its practical impact for scalable, shape-aware adaptation of foundation models.

Abstract

Fine-tuning the Segment Anything Model (SAM) for infrared small target detection poses significant challenges due to severe domain shifts. Existing adaptation methods often incorporate handcrafted priors to bridge this gap, yet such designs limit generalization and scalability. We identify a fundamental texture bias in foundation models, which overly depend on local texture cues for target localization. To address this, we propose Ladder Shape-Biased Side-Tuning (LSP-ST), a novel approach that introduces a shape-aware inductive bias to facilitate effective adaptation beyond texture cues. In contrast to prior work that injects explicit edge or contour features, LSP-ST models shape as a global structural prior, integrating both boundaries and internal layouts. We design a Shape-Enhanced Large-Kernel Attention Module to hierarchically and implicitly capture structural information in a fully differentiable manner, without task-specific handcrafted guidance. A theoretical analysis grounded in matched filtering and backpropagation reveals the mechanism by which the proposed attention improves structure-aware learning. With only 4.72M learnable parameters, LSP-ST achieves state-of-the-art performance on multiple infrared small target detection benchmarks. Furthermore, its strong generalization is validated across tasks such as mirror detection, shadow detection, and camouflaged object detection, while maintaining stable performance on texture-driven tasks like salient object detection, demonstrating that the introduced shape bias complements rather than competes with texture-based reasoning.

Paper Structure

This paper contains 30 sections, 28 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The overall architecture of the proposed Ladder Shape-Biased Side-Tuning (LSP-ST) framework. A dual-branch structure is constructed on top of the frozen encoder SAM2 except for the normalization layers, where the side-tuning branch introduces shape-biased representations to complement the texture-biased features from the backbone. Unidirectional connections and channel reduction are employed to achieve memory- and parameter-efficient training yin2024parameter. Following the baseline SAM2-UNet xiong2024sam2, we replace the original SAM2 decoder with a UNet-style decoder archit2025segment for dense prediction.
  • Figure 2: A radar chart is constructed based on partial results from six downstream tasks, where the baseline is SAM2-UNet xiong2024sam2. The Previous SOTA represents the best-performing algorithms for each downstream task, excluding the baseline, including CSFwinformer (TIP'24), IRSAM (ECCV'24), SAM-SPL (TGRS'25), SAIST (CVPR'25), Dual-SAM (CVPR'24), Rmlanet (TCSVT'23), HGINet (TIP'24), UGDNet (TMM'25), AdaptCOD (IJCV'25), CamoDiffusion (TPAMI'25), and MDSAM (MM'24). It can be observed that the proposed method consistently achieves superior overall performance and outperforms the strong baseline across various tasks. In particular, conventional fine-tuning (e.g., Mona (CVPR'25) yin20255) tends to degrade performance, whereas our approach mitigates this issue. Notably, it delivers substantial improvements on texture-insensitive tasks, while maintaining competitive results on texture-dependent ones, thereby demonstrating its strong generalization capability.
  • Figure 3: Comparison of visual recognition results under complex scene texture interference. It can be observed that the proposed method demonstrates greater robustness in these scenarios.
  • Figure 4: In complex scenes, intra-object distributions better capture target characteristics than boundary cues. Together, they jointly define the overall shape of the object.
  • Figure 5: Detailed architecture of the proposed LSP-ST. To reduce the number of learnable parameters, for each stage, we first perform feature dimensionality reduction via the RFB module xiong2024sam2 and incorporate only a single instance of the proposed HDConv Block.
  • ...and 5 more figures