LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection
Guoyi Zhang, Siyang Chen, Guangsheng Xu, Han Wang, Donghe Wang, Xiaohu Zhang
TL;DR
This work tackles the challenge of adapting vision foundation models like SAM for infrared small target detection by identifying a texture bias that hinders shape-agnostic localization. It introduces Ladder Shape-Biased Side-Tuning (LSP-ST), a dual-branch framework that freezes the SAM2 backbone while training a shape-biased HDConv side branch, connected via memory-efficient unidirectional links and complemented by a UNet-style decoder. Central to the method is Shape-Enhanced Large-kernel Attention (SELKA), which uses multi-branch SConv operations to form a Gaussian-like ERF that captures both internal target distributions and boundaries without explicit edge priors, supported by a theoretical matched-filter analysis and PSF-aligned receptive-field arguments. Empirically, LSP-ST achieves state-of-the-art results with only 4.72M learnable parameters on infrared small target benchmarks and demonstrates strong generalization to texture-insensitive tasks, while maintaining performance on texture-driven tasks, highlighting its practical impact for scalable, shape-aware adaptation of foundation models.
Abstract
Fine-tuning the Segment Anything Model (SAM) for infrared small target detection poses significant challenges due to severe domain shifts. Existing adaptation methods often incorporate handcrafted priors to bridge this gap, yet such designs limit generalization and scalability. We identify a fundamental texture bias in foundation models, which overly depend on local texture cues for target localization. To address this, we propose Ladder Shape-Biased Side-Tuning (LSP-ST), a novel approach that introduces a shape-aware inductive bias to facilitate effective adaptation beyond texture cues. In contrast to prior work that injects explicit edge or contour features, LSP-ST models shape as a global structural prior, integrating both boundaries and internal layouts. We design a Shape-Enhanced Large-Kernel Attention Module to hierarchically and implicitly capture structural information in a fully differentiable manner, without task-specific handcrafted guidance. A theoretical analysis grounded in matched filtering and backpropagation reveals the mechanism by which the proposed attention improves structure-aware learning. With only 4.72M learnable parameters, LSP-ST achieves state-of-the-art performance on multiple infrared small target detection benchmarks. Furthermore, its strong generalization is validated across tasks such as mirror detection, shadow detection, and camouflaged object detection, while maintaining stable performance on texture-driven tasks like salient object detection, demonstrating that the introduced shape bias complements rather than competes with texture-based reasoning.
