Table of Contents
Fetching ...

RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, Quan Wang

TL;DR

This work tackles zero-shot open-vocabulary visual grounding in remote sensing by proposing RSVG-ZeroOV, a training-free framework that fuses frozen vision-language and diffusion models. The method follows an overview-focus-evolve pipeline: an overview cross-attention map A_C from a VLM, a focus step that injects diffusion-model self-attention priors A_S to form A_CS, and an evolve step that expands regions from high-activation seeds to produce a clean mask, with an optional SAM-based refinement. Experiments on RRSIS-D and RISBench show state-of-the-art zero-shot performance, outperforming weakly-supervised and other zero-shot baselines, demonstrating strong open-vocabulary grounding without task-specific training. The approach provides a scalable, data-efficient path for RS perception, leveraging structure and semantics encoded in foundation models through a lightweight, train-free pipeline that can adapt to diverse user prompts and object concepts with minimal supervision.

Abstract

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

TL;DR

This work tackles zero-shot open-vocabulary visual grounding in remote sensing by proposing RSVG-ZeroOV, a training-free framework that fuses frozen vision-language and diffusion models. The method follows an overview-focus-evolve pipeline: an overview cross-attention map A_C from a VLM, a focus step that injects diffusion-model self-attention priors A_S to form A_CS, and an evolve step that expands regions from high-activation seeds to produce a clean mask, with an optional SAM-based refinement. Experiments on RRSIS-D and RISBench show state-of-the-art zero-shot performance, outperforming weakly-supervised and other zero-shot baselines, demonstrating strong open-vocabulary grounding without task-specific training. The approach provides a scalable, data-efficient path for RS perception, leveraging structure and semantics encoded in foundation models through a lightweight, train-free pipeline that can adapt to diverse user prompts and object concepts with minimal supervision.

Abstract

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

Paper Structure

This paper contains 10 sections, 7 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Illustration of different open-vocabulary remote sensing tasks. Left and middle panels depict category-driven tasks, where models rely on predefined category names to distinguish different image regions. Right panel represents intention-driven paradigm, enabling users to flexibly specify objects using natural language expressions.
  • Figure 2: (a) Visualization of receptive fields derived from self-attention maps in DM. (b) Comparison of attention embedding results using self-attention maps from different models.
  • Figure 3: The framework of the proposed RSVG-ZeroOV.
  • Figure 4: Visualization of some cross-attention maps from VLM on RRSIS-D test set.