Table of Contents
Fetching ...

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Wenyang Chen, Zhanxuan Hu, Yaping Zhang, Hailong Ning, Yonghang Tai

Abstract

Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer

ConInfer: Context-Aware Inference for Training-Free Open-Vocabulary Remote Sensing Segmentation

Abstract

Training-free open-vocabulary remote sensing segmentation (OVRSS), empowered by vision-language models, has emerged as a promising paradigm for achieving category-agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch-level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real-world applications, remote sensing scenes are typically large-scale and exhibit strong spatial as well as semantic correlations, making isolated patch-wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context-aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter-unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state-of-the-art per-pixel VLM-based baselines such as SegEarth-OV, achieving average improvements of 2.80% and 6.13% on open-vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog-Yang/ConInfer

Paper Structure

This paper contains 30 sections, 13 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Compared with the direct predictions of MaskCLIP, the segmentation maps generated by clustering patch features from CLIP or DINOv3 exhibit superior spatial consistency.
  • Figure 2: Overview of the proposed Context-Aware Inference (ConInfer) framework. ConInfer performs joint, training-free inference by integrating semantic priors from CLIP with structural regularization from DINO. This formulation enables globally consistent open-vocabulary segmentation without any retraining.
  • Figure 3: Examples from the UDD5 dataset illustrating cases where ConInfer encounters difficulties, including weak boundaries on small objects (vehicles) and category confusion (roof tiles).
  • Figure 4: Qualitative comparison of different OVSS methods on the OpenEarthMap (land cover mapping), VDD (UAV aerial scene), and WBI-SI (water body segmentation) datasets. GT denotes the ground truth.
  • Figure 5: Convergence analysis was conducted on 8 datasets. During the iterative process, the $loss$ rapidly converged and subsequently remained stable.
  • ...and 3 more figures