Table of Contents
Fetching ...

Zero-Shot Refinement of Buildings' Segmentation Models using SAM

Ali Mayladan, Hasan Nasrallah, Hasan Moughnieh, Mustafa Shukor, Ali J. Ghandour

TL;DR

This work addresses the challenge of building footprint segmentation in remote sensing where CNN generalization suffers from distribution shifts. It proposes a zero-shot refinement of the Segment Anything Model (SAM) by introducing a CNN-based prompt generator that converts CNN-derived masks into SAM prompts across single-point, multiple-points, bounding-box, and hybrid strategies, optionally using negative points. The approach yields substantial improvements on the WHU Buildings dataset (e.g., IoU up to +5.47 percentage points and F1 up to +4.81 points in out-of-distribution settings, with strong TP-IoU/TP-F1 gains) and shows positive gains on in-distribution metrics, with bounding-box prompts often delivering the strongest results. These findings suggest that integrating domain-specific priors via prompt engineering with SAM can enhance domain-specific, instance-level segmentation in remote sensing, motivating broader adoption of foundation-model prompting in geospatial tasks.

Abstract

Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise building instance segmentation is vital for applications like urban planning. While Convolutional Neural Networks (CNNs) perform well, their generalization can be limited. For this aim, we present a novel approach to adapt foundation models to address existing models' generalization dropback. Among several models, our focus centers on the Segment Anything Model (SAM), a potent foundation model renowned for its prowess in class-agnostic image segmentation capabilities. We start by identifying the limitations of SAM, revealing its suboptimal performance when applied to remote sensing imagery. Moreover, SAM does not offer recognition abilities and thus fails to classify and tag localized objects. To address these limitations, we introduce different prompting strategies, including integrating a pre-trained CNN as a prompt generator. This novel approach augments SAM with recognition abilities, a first of its kind. We evaluated our method on three remote sensing datasets, including the WHU Buildings dataset, the Massachusetts Buildings dataset, and the AICrowd Mapping Challenge. For out-of-distribution performance on the WHU dataset, we achieve a 5.47\% increase in IoU and a 4.81\% improvement in F1-score. For in-distribution performance on the WHU dataset, we observe a 2.72\% and 1.58\% increase in True-Positive-IoU and True-Positive-F1 score, respectively. Our code is publicly available at this Repo (https://github.com/geoaigroup/GEOAI-ECRS2023), hoping to inspire further exploration of foundation models for domain-specific tasks within the remote sensing community.

Zero-Shot Refinement of Buildings' Segmentation Models using SAM

TL;DR

This work addresses the challenge of building footprint segmentation in remote sensing where CNN generalization suffers from distribution shifts. It proposes a zero-shot refinement of the Segment Anything Model (SAM) by introducing a CNN-based prompt generator that converts CNN-derived masks into SAM prompts across single-point, multiple-points, bounding-box, and hybrid strategies, optionally using negative points. The approach yields substantial improvements on the WHU Buildings dataset (e.g., IoU up to +5.47 percentage points and F1 up to +4.81 points in out-of-distribution settings, with strong TP-IoU/TP-F1 gains) and shows positive gains on in-distribution metrics, with bounding-box prompts often delivering the strongest results. These findings suggest that integrating domain-specific priors via prompt engineering with SAM can enhance domain-specific, instance-level segmentation in remote sensing, motivating broader adoption of foundation-model prompting in geospatial tasks.

Abstract

Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise building instance segmentation is vital for applications like urban planning. While Convolutional Neural Networks (CNNs) perform well, their generalization can be limited. For this aim, we present a novel approach to adapt foundation models to address existing models' generalization dropback. Among several models, our focus centers on the Segment Anything Model (SAM), a potent foundation model renowned for its prowess in class-agnostic image segmentation capabilities. We start by identifying the limitations of SAM, revealing its suboptimal performance when applied to remote sensing imagery. Moreover, SAM does not offer recognition abilities and thus fails to classify and tag localized objects. To address these limitations, we introduce different prompting strategies, including integrating a pre-trained CNN as a prompt generator. This novel approach augments SAM with recognition abilities, a first of its kind. We evaluated our method on three remote sensing datasets, including the WHU Buildings dataset, the Massachusetts Buildings dataset, and the AICrowd Mapping Challenge. For out-of-distribution performance on the WHU dataset, we achieve a 5.47\% increase in IoU and a 4.81\% improvement in F1-score. For in-distribution performance on the WHU dataset, we observe a 2.72\% and 1.58\% increase in True-Positive-IoU and True-Positive-F1 score, respectively. Our code is publicly available at this Repo (https://github.com/geoaigroup/GEOAI-ECRS2023), hoping to inspire further exploration of foundation models for domain-specific tasks within the remote sensing community.
Paper Structure (5 sections, 2 figures, 2 tables)

This paper contains 5 sections, 2 figures, 2 tables.

Figures (2)

  • Figure S1: Input RGB image undergoes rooftop instance segmentation via the CNN model. Segmentation masks are passed to the Prompt Generator used to prompt SAM. This approach would equip SAM with recognition abilities and generate precise buildings output masks.
  • Figure S2: Visualizations, over three different images from WHU dataset, of prompt engineering experiments including Single-point, Single-point + Negative-points (in red), Skeleton Multiple-points, Random Multiple-points and Bounding-box.