Table of Contents
Fetching ...

UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction

Chenyu Li, Danfeng Hong, Bing Zhang, Yuxuan Li, Gustau Camps-Valls, Xiao Xiang Zhu, Jocelyn Chanussot

TL;DR

UrbanSAM presents invariance-inspired learnable adapters integrated with Segment Anything Model to address scale variation and morphological heterogeneity in global urban remote sensing segmentation. Grounded in multi-resolution analysis, it learns scale-invariant features via U-Scaling adapters and transfers this knowledge to the encoder through cross-alignment, enabling learnable prompts without manual intervention. The approach combines a four-module U-Scaling adapter, cross-branch masked attention, a hierarchical consistency decoder, and LoRA-based fine-tuning, achieving state-of-the-art results on water, road, and building tasks with notable parameter efficiency. The work offers significant potential for scalable urban area mapping and change detection across diverse geographies and data sources.

Abstract

Object extraction and segmentation from remote sensing (RS) images is a critical yet challenging task in urban environment monitoring. Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. These challenges are amplified by heterogeneity and scale disparities across RS data sources, including sensors, platforms, and modalities, making accurate object segmentation particularly demanding. While the Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes, its performance in handling form-varying objects remains limited due to manual-interactive prompting. To this end, we propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments while tackling scaling effects from remotely sensed observations. Inspired by multi-resolution analysis (MRA) theory, UrbanSAM incorporates a novel learnable prompter equipped with a Uscaling-Adapter that adheres to the invariance criterion, enabling the model to capture multiscale contextual information of objects and adapt to arbitrary scale variations with theoretical guarantees. Furthermore, features from the Uscaling-Adapter and the trunk encoder are aligned through a masked cross-attention operation, allowing the trunk encoder to inherit the adapter's multiscale aggregation capability. This synergy enhances the segmentation performance, resulting in more powerful and accurate outputs, supported by the learned adapter. Extensive experimental results demonstrate the flexibility and superior segmentation performance of the proposed UrbanSAM on a global-scale dataset, encompassing scale-varying urban objects such as buildings, roads, and water.

UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction

TL;DR

UrbanSAM presents invariance-inspired learnable adapters integrated with Segment Anything Model to address scale variation and morphological heterogeneity in global urban remote sensing segmentation. Grounded in multi-resolution analysis, it learns scale-invariant features via U-Scaling adapters and transfers this knowledge to the encoder through cross-alignment, enabling learnable prompts without manual intervention. The approach combines a four-module U-Scaling adapter, cross-branch masked attention, a hierarchical consistency decoder, and LoRA-based fine-tuning, achieving state-of-the-art results on water, road, and building tasks with notable parameter efficiency. The work offers significant potential for scalable urban area mapping and change detection across diverse geographies and data sources.

Abstract

Object extraction and segmentation from remote sensing (RS) images is a critical yet challenging task in urban environment monitoring. Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. These challenges are amplified by heterogeneity and scale disparities across RS data sources, including sensors, platforms, and modalities, making accurate object segmentation particularly demanding. While the Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes, its performance in handling form-varying objects remains limited due to manual-interactive prompting. To this end, we propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments while tackling scaling effects from remotely sensed observations. Inspired by multi-resolution analysis (MRA) theory, UrbanSAM incorporates a novel learnable prompter equipped with a Uscaling-Adapter that adheres to the invariance criterion, enabling the model to capture multiscale contextual information of objects and adapt to arbitrary scale variations with theoretical guarantees. Furthermore, features from the Uscaling-Adapter and the trunk encoder are aligned through a masked cross-attention operation, allowing the trunk encoder to inherit the adapter's multiscale aggregation capability. This synergy enhances the segmentation performance, resulting in more powerful and accurate outputs, supported by the learned adapter. Extensive experimental results demonstrate the flexibility and superior segmentation performance of the proposed UrbanSAM on a global-scale dataset, encompassing scale-varying urban objects such as buildings, roads, and water.

Paper Structure

This paper contains 29 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: UrbanSAM achieves robust handling of scale variations and superior segmentation performance owing to its invariance-aware adapters inspired by MRA theory. This design effectively addresses scale effects, enhancing adaptability and accuracy in urban segmentation tasks. Notably, manual prompts (point or box) often produce incomplete, low-activation attention maps with high errors and obvious noise, whereas UrbanSAM's learned invariance-aware adapters precisely localize the relevant elements or objects by accumulating and integrating attention across different scales.
  • Figure 2: An illustrative workflow of the proposed UrbanSAM, which can segment dominant urban elements (e.g., buildings, water, roads) by leveraging the invariance attribute from MRA theory. (a) Invariance is embedded into the prompter design through cascaded U-scaling adapters (Fig. \ref{['fig:adapter']}) that capture hidden cues across multiple scales. (b) These learned multiscale features in the prompt stream are transferred to the SAM main body (i.e., transformer blocks), guiding the alignment of subsequent representations and leading to more robust, superior segmentation performance in urban scenes.
  • Figure 3: An illustrative workflow of the proposed U-Scaling adapters, which follows MRA theory and aims to provide adaptive and effective prompt guidance for SAM by learning scene intrinsic scale invariance across multiple resolutions, where each adapter is designed to approximate different forms of optimal receptive fields at different resolutions.
  • Figure 4: Global distribution of sample datasets for urban construction (building, water, and road), marked by distinct shapes. Paired examples (RS images and corresponding labels) illustrate the content of each dataset.
  • Figure 5: Visualization of water body extraction results using various methods compared to our UrbanSAM, where red represents false positives, and yellow indicates false negatives. (A)-(C) present large-scale qualitative outcomes from UrbanSAM across three regions. (a)–(o) illustrate water body extraction in selected ROIs using FCN8s, UNet, LinkNet50, PSPNet, DeepLabv3+, HRNetv2, MECNet, Segformer, Uformer, MFSegformer, SAM, SAMDB, HQSAMDB, and UrbanSAM, respectively.
  • ...and 6 more figures