Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images

Shiyu Miao; Delong Chen; Fan Liu; Chuanyi Zhang; Yanhui Gu; Shengjie Guo; Jun Zhou

Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images

Shiyu Miao, Delong Chen, Fan Liu, Chuanyi Zhang, Yanhui Gu, Shengjie Guo, Jun Zhou

TL;DR

The paper addresses the gap in semantic contour extraction for remote sensing by extending DirectSAM with a vision-language prompter and a large-scale dataset, RemoteContour-34k. It introduces DirectSAM-RS, a multi-task, promptable foundation model that can operate in zero-shot settings and achieve state-of-the-art results when fine-tuned on road, building, and coastline benchmarks. The approach relies on a Mask2Contour transformation to repurpose existing segmentation datasets into contour annotations and a cross-attention prompter to condition contours on language prompts, enabling flexible, class-aware segmentation. The findings demonstrate substantial performance gains and the potential for scalable, multi-task remote-sensing contour extraction with language conditioning, offering practical impact for tasks like mapping and infrastructure monitoring. The work also suggests directions for expanding data and exploring few-shot learning to further enhance generalization in diverse remote-sensing environments.

Abstract

The Direct Segment Anything Model (DirectSAM) excels in class-agnostic contour extraction. In this paper, we explore its use by applying it to optical remote sensing imagery, where semantic contour extraction-such as identifying buildings, road networks, and coastlines-holds significant practical value. Those applications are currently handled via training specialized small models separately on small datasets in each domain. We introduce a foundation model derived from DirectSAM, termed DirectSAM-RS, which not only inherits the strong segmentation capability acquired from natural images, but also benefits from a large-scale dataset we created for remote sensing semantic contour extraction. This dataset comprises over 34k image-text-contour triplets, making it at least 30 times larger than individual dataset. DirectSAM-RS integrates a prompter module: a text encoder and cross-attention layers attached to the DirectSAM architecture, which allows flexible conditioning on target class labels or referring expressions. We evaluate the DirectSAM-RS in both zero-shot and fine-tuning setting, and demonstrate that it achieves state-of-the-art performance across several downstream benchmarks.

Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images

TL;DR

Abstract

Paper Structure (15 sections, 6 figures, 1 table)

This paper contains 15 sections, 6 figures, 1 table.

Introduction
Method
Preliminary: DirectSAM
Overview of Our Methodology
Dataset Construction
Model Architecture
Experiments
Implementation Details
Evaluation Metrics
Benchmarking DirectSAM-RS
Zero-shot (ZS) setting
Fine-tuning (FT) setting
Ablation of DirectSAM SA-1B Pretraining
Importance of Scaling-up Pretraining Data
Conclusion and Future Works

Figures (6)

Figure 1: Composition of the proposed dataset. We annotated the number of samples and the percentage of each subset and each class. The RemoteContour-34k dataset consists both rich semantics (as shown by the word cloud), and diverse visual domains (e.g., urban, rural).
Figure 2: While being able to successfully identify most key elements, the raw DirectSAM model suffers from missing segmentation of cars and roads, and over-segmentation of part components. Moreover, the model extracts the contours of all semantic targets as it's class-agnostic. These issues limit its direct applicability for semantic contour extraction in remote sensing.
Figure 3: Example of the proposed Mask2Contour (M2C) transformation. It enables us to repurpose the existing semantic segmentation dataset with mask annotations to the semantic contour extraction task.
Figure 4: Model architecture of the proposed DirectSAM-RS. We extend the base DirectSAM model (encoder blocks and contour decoder) with a prompter architecture (red). This prompter consists of a text encoder that extracts semantic information from textual prompts, and cross-attention layers that fuse the prompt information into visual feature maps at different stages.
Figure 5: Inference examples of both zero-shot and fine-tuned DirectSAM-RS. Zero-shot DirectSAM-RS (left) demonstrates its ability to flexibly adjust the semantic target according to the given prompt, while fine-tuned DirectSAM-RS (right) produces accurate contours for specific classes.
...and 1 more figures

Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images

TL;DR

Abstract

Prompting DirectSAM for Semantic Contour Extraction in Remote Sensing Images

Authors

TL;DR

Abstract

Table of Contents

Figures (6)