Zero-Shot Refinement of Buildings' Segmentation Models using SAM
Ali Mayladan, Hasan Nasrallah, Hasan Moughnieh, Mustafa Shukor, Ali J. Ghandour
TL;DR
This work addresses the challenge of building footprint segmentation in remote sensing where CNN generalization suffers from distribution shifts. It proposes a zero-shot refinement of the Segment Anything Model (SAM) by introducing a CNN-based prompt generator that converts CNN-derived masks into SAM prompts across single-point, multiple-points, bounding-box, and hybrid strategies, optionally using negative points. The approach yields substantial improvements on the WHU Buildings dataset (e.g., IoU up to +5.47 percentage points and F1 up to +4.81 points in out-of-distribution settings, with strong TP-IoU/TP-F1 gains) and shows positive gains on in-distribution metrics, with bounding-box prompts often delivering the strongest results. These findings suggest that integrating domain-specific priors via prompt engineering with SAM can enhance domain-specific, instance-level segmentation in remote sensing, motivating broader adoption of foundation-model prompting in geospatial tasks.
Abstract
Foundation models have excelled in various tasks but are often evaluated on general benchmarks. The adaptation of these models for specific domains, such as remote sensing imagery, remains an underexplored area. In remote sensing, precise building instance segmentation is vital for applications like urban planning. While Convolutional Neural Networks (CNNs) perform well, their generalization can be limited. For this aim, we present a novel approach to adapt foundation models to address existing models' generalization dropback. Among several models, our focus centers on the Segment Anything Model (SAM), a potent foundation model renowned for its prowess in class-agnostic image segmentation capabilities. We start by identifying the limitations of SAM, revealing its suboptimal performance when applied to remote sensing imagery. Moreover, SAM does not offer recognition abilities and thus fails to classify and tag localized objects. To address these limitations, we introduce different prompting strategies, including integrating a pre-trained CNN as a prompt generator. This novel approach augments SAM with recognition abilities, a first of its kind. We evaluated our method on three remote sensing datasets, including the WHU Buildings dataset, the Massachusetts Buildings dataset, and the AICrowd Mapping Challenge. For out-of-distribution performance on the WHU dataset, we achieve a 5.47\% increase in IoU and a 4.81\% improvement in F1-score. For in-distribution performance on the WHU dataset, we observe a 2.72\% and 1.58\% increase in True-Positive-IoU and True-Positive-F1 score, respectively. Our code is publicly available at this Repo (https://github.com/geoaigroup/GEOAI-ECRS2023), hoping to inspire further exploration of foundation models for domain-specific tasks within the remote sensing community.
