Table of Contents
Fetching ...

Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

TL;DR

This work tackles salient object detection in optical remote sensing images with challenging low contrast and scale variation. It introduces ASCNet, a visual state space–based encoder–decoder that combines a Multi-Level Context Module for cross-scale, topology-aware fusion with an Adaptive Patchwise Visual State Space decoder that employs a Granularity-aware Propagation Module and Dynamic Adaptive Granularity Scan to balance global context with fine-grained local details. The approach yields state-of-the-art results on ORSSD and EORSSD, supported by comprehensive ablations and scanning-strategy analyses that confirm the complementary benefits of MLCM, GPM, and DAGS. The proposed framework advances robust, boundary-accurate saliency detection in complex remote sensing scenes and offers a scalable, geometry-conscious modeling paradigm for ORSI-SOD tasks.

Abstract

Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a Multi-Level Context Module (MLCM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the APVSS block as the decoder of ASCNet. This module integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images

TL;DR

This work tackles salient object detection in optical remote sensing images with challenging low contrast and scale variation. It introduces ASCNet, a visual state space–based encoder–decoder that combines a Multi-Level Context Module for cross-scale, topology-aware fusion with an Adaptive Patchwise Visual State Space decoder that employs a Granularity-aware Propagation Module and Dynamic Adaptive Granularity Scan to balance global context with fine-grained local details. The approach yields state-of-the-art results on ORSSD and EORSSD, supported by comprehensive ablations and scanning-strategy analyses that confirm the complementary benefits of MLCM, GPM, and DAGS. The proposed framework advances robust, boundary-accurate saliency detection in complex remote sensing scenes and offers a scalable, geometry-conscious modeling paradigm for ORSI-SOD tasks.

Abstract

Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a Multi-Level Context Module (MLCM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the APVSS block as the decoder of ASCNet. This module integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

Paper Structure

This paper contains 17 sections, 16 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overall performance comparison on the EORSSD(L) and ORSSD(R) datasets. Our method consistently outperforms recent SOTA approaches across multiple evaluation metrics, demonstrating stable and superior performance on both datasets.
  • Figure 2: Qualitative comparison results on challenging ORSIs scenes. From left to right are the RGB images, ground truth (GT), results produced by our method, and those generated by representative state-of-the-art (SOTA) methods. Our approach yields more accurate object localization and boundary delineation, while reducing misdetection and boundary errors.
  • Figure 3: Pipeline of the proposed method.A: ASCNet adopts encoder-decoder architecture. The encoder is built upon the state space block. In the skip connections, a MLCM is employed to enhance the model’s representation capability through feature fusion and a topology-aware attention mechanism. Subsequently, in the decoding stage, the features processed by MLCM(D) are fed into the APVSS blocks for decoding. B: The proposed APVSS blocks enhance the model’s ability to capture local information through a GPM(E) and a DAGS(C), aiming for more precise boundary representation.
  • Figure 4: The quantitative comparisons between our method and other ORSI-SOD approaches on the ORSSDLVNet_2019 and EORSSDDFANet_2020 datasets are presented.
  • Figure 5: The qualitative comparison between our method and other seven SOTA approaches across various challenging scenarios.
  • ...and 2 more figures