Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images
Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong
TL;DR
This work tackles salient object detection in optical remote sensing images with challenging low contrast and scale variation. It introduces ASCNet, a visual state space–based encoder–decoder that combines a Multi-Level Context Module for cross-scale, topology-aware fusion with an Adaptive Patchwise Visual State Space decoder that employs a Granularity-aware Propagation Module and Dynamic Adaptive Granularity Scan to balance global context with fine-grained local details. The approach yields state-of-the-art results on ORSSD and EORSSD, supported by comprehensive ablations and scanning-strategy analyses that confirm the complementary benefits of MLCM, GPM, and DAGS. The proposed framework advances robust, boundary-accurate saliency detection in complex remote sensing scenes and offers a scalable, geometry-conscious modeling paradigm for ORSI-SOD tasks.
Abstract
Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a Multi-Level Context Module (MLCM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the APVSS block as the decoder of ASCNet. This module integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.
