Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images
Shiyu Miao, Delong Chen, Fan Liu, Chuanyi Zhang, Yanhui Gu, Shengjie Guo, Jun Zhou
TL;DR
The paper addresses the gap in semantic contour extraction for remote sensing by extending DirectSAM with a vision-language prompter and a large-scale dataset, RemoteContour-34k. It introduces DirectSAM-RS, a multi-task, promptable foundation model that can operate in zero-shot settings and achieve state-of-the-art results when fine-tuned on road, building, and coastline benchmarks. The approach relies on a Mask2Contour transformation to repurpose existing segmentation datasets into contour annotations and a cross-attention prompter to condition contours on language prompts, enabling flexible, class-aware segmentation. The findings demonstrate substantial performance gains and the potential for scalable, multi-task remote-sensing contour extraction with language conditioning, offering practical impact for tasks like mapping and infrastructure monitoring. The work also suggests directions for expanding data and exploring few-shot learning to further enhance generalization in diverse remote-sensing environments.
Abstract
The Direct Segment Anything Model (DirectSAM) excels in class-agnostic contour extraction. In this paper, we explore its use by applying it to optical remote sensing imagery, where semantic contour extraction-such as identifying buildings, road networks, and coastlines-holds significant practical value. Those applications are currently handled via training specialized small models separately on small datasets in each domain. We introduce a foundation model derived from DirectSAM, termed DirectSAM-RS, which not only inherits the strong segmentation capability acquired from natural images, but also benefits from a large-scale dataset we created for remote sensing semantic contour extraction. This dataset comprises over 34k image-text-contour triplets, making it at least 30 times larger than individual dataset. DirectSAM-RS integrates a prompter module: a text encoder and cross-attention layers attached to the DirectSAM architecture, which allows flexible conditioning on target class labels or referring expressions. We evaluate the DirectSAM-RS in both zero-shot and fine-tuning setting, and demonstrate that it achieves state-of-the-art performance across several downstream benchmarks.
