Table of Contents
Fetching ...

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

TL;DR

GeoSeg-1M and GeoSeg-Bench address the lack of large-scale open-world instruction-driven segmentation data for geospatial scenes by unifying referring, interactive, and reasoning tasks. The authors introduce UniGeoSeg, a unified baseline with task-aware text enhancement (TATE), latent knowledge memory (LKM), and progressive task scheduling (PTS) to enable multi-task learning and robust generalization. Experiments show state-of-the-art performance on GeoSeg-Bench and strong zero-shot and cross-dataset generalization across RS benchmarks. The resources provide a scalable foundation for instruction-driven geospatial understanding and open-world RS intelligences.

Abstract

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

TL;DR

GeoSeg-1M and GeoSeg-Bench address the lack of large-scale open-world instruction-driven segmentation data for geospatial scenes by unifying referring, interactive, and reasoning tasks. The authors introduce UniGeoSeg, a unified baseline with task-aware text enhancement (TATE), latent knowledge memory (LKM), and progressive task scheduling (PTS) to enable multi-task learning and robust generalization. Experiments show state-of-the-art performance on GeoSeg-Bench and strong zero-shot and cross-dataset generalization across RS benchmarks. The resources provide a scalable foundation for instruction-driven geospatial understanding and open-world RS intelligences.

Abstract

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

Paper Structure

This paper contains 43 sections, 9 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Examples from GeoSeg-1M. (a) Referring segmentation; (b) Interactive segmentation; (c) Attribute-oriented reasoning segmentation; (d) Context-oriented reasoning segmentation.
  • Figure 2: The diagram of UniGeoSeg. The top indicates the whole pipeline, and the bottom describes each module.
  • Figure 3: Qualitative examples of the segmentations generated by UniGeoSeg and comparative methods on GeoSeg-Bench.
  • Figure 4: The three mask marking strategies we tried in model-based mask filtering. (a) Boundary-only highlight. (b) Semi-transparent filled-mask overlay. (c) The original image and the binary mask
  • Figure 5: The prompt of InternVL3 for mask filtering. The images marked with $\checkmark$ in the top-right corner are retained, as their mask regions exhibit clear semantic structure, while the images marked with $\times$ in the bottom-right corner are filtered out due to ambiguous semantics.
  • ...and 10 more figures