Table of Contents
Fetching ...

Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

Xing Zi, Kairui Jin, Xian Tao, Jun Li, Ali Braytee, Rajiv Ratn Shah, Mukesh Prasad

TL;DR

The paper tackles zero-shot remote sensing segmentation by addressing CLIP's global-focus bias, SAM's mask redundancy, and lack of multi-scale adaptation. It introduces VTPSeg, a training-free multi-model framework that combines Grounding DINO+ for multi-scale detection, CLIP Filter++ for refined, context-aware filtering via visual/text prompts, and FastSAM for precise, point-prompt-guided segmentation. The approach yields consistent MIoU and pixel-accuracy gains across multiple remote sensing datasets (urban, rural, and disaster contexts) and outperforms several zero-shot baselines, demonstrating strong generalization and practical impact. By enabling open-vocabulary, multi-scale segmentation without task-specific re-training, VTPSeg advances scalable remote sensing analysis for environmental monitoring, disaster response, and urban planning.

Abstract

Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.

Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

TL;DR

The paper tackles zero-shot remote sensing segmentation by addressing CLIP's global-focus bias, SAM's mask redundancy, and lack of multi-scale adaptation. It introduces VTPSeg, a training-free multi-model framework that combines Grounding DINO+ for multi-scale detection, CLIP Filter++ for refined, context-aware filtering via visual/text prompts, and FastSAM for precise, point-prompt-guided segmentation. The approach yields consistent MIoU and pixel-accuracy gains across multiple remote sensing datasets (urban, rural, and disaster contexts) and outperforms several zero-shot baselines, demonstrating strong generalization and practical impact. By enabling open-vocabulary, multi-scale segmentation without task-specific re-training, VTPSeg advances scalable remote sensing analysis for environmental monitoring, disaster response, and urban planning.

Abstract

Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.

Paper Structure

This paper contains 18 sections, 7 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: This is remote sensing target image recognition for three different zero-shots approaches. (a) Multi-Modal Based Classification. (b) SAM Based Text Prompts Segmentation. (c) Our Proposed Method - VTPSeg
  • Figure 2: (a) The overall pipeline of our VTPSeg, given a remote sensing image, the user needs to provide a set of segmented text queries of interest. Here, the text of interest is building. Multiple similar descriptions of the same target can effectively avoid detection misses, e.g., a roof of building. text prompts and multis-scale Patch of the image are fed into Grounding DINO for detection. Useless frames are then suppressed by the NMS method. This is followed by moving to the (b) CLIP Filter++ Stage, where a Visual Prompts module and an Attention Classifier module are included, the latter being used to evaluate the graphical alignment of each Visual Prompt.
  • Figure 3: Comparison of visualization results from FastSAM (Everything and text prompts), MobileSAM, Grounded-SAM, and VTPSeg across four datasets. It is clear that our VTPSeg consistently delivers significantly outcomes.