Table of Contents
Fetching ...

ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

Ke Li, Ting Wang, Di Wang, Yongshan Zhu, Yiming Zhang, Tao Lei, Quan Wang

Abstract

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.

ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

Abstract

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.

Paper Structure

This paper contains 23 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Left: Human cognition resolves referring expressions in a staged manner, involving global context understanding, spatial localization, and attribute verification. Right: Inspired by this process, we propose a progressive grounding paradigm for RSVG, which decouples linguistic cues into global context, spatial relations, and object attributes, and integrates them via a survey-locate-verify scheme to guide visual attention.
  • Figure 2: Architectural comparison of different cross-modal modulators. (a) Global context guidance. (b) Parallel injection of decoupled spatial and attribute cues. (c) Sequential injection of decoupled spatial and attribute cues. (d) The proposed progressive cross-modal modulator based on survey-locate-verify attention. Here, $L^c$, $L^s$, and $L^a$ denote the context, spatial, and attribute features, respectively, and VB-Stage-X represents the X-th stage of the visual backbone.
  • Figure 3: Visualization of the attention maps of the proposed progressive cross-modal modulator.
  • Figure 4: Overall framework of the proposed ProVG. It consists of three components: (a) a visual--text feature extractor with progressive cross-modal modulator, (b) a cross-scale fusion module, and (c) a language-guided calibration decoder with a unified multi-task prediction head.
  • Figure 5: Qualitative comparisons between ProVG and previous SOTA methods.