Table of Contents
Fetching ...

SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng

TL;DR

SATGround introduces a spatially-aware grounding mechanism for vision-language models in remote sensing by pairing a frozen visual encoder with a finetuned language model and a dedicated grounding module. It uses two special tokens, <bb> and <loc>, to bridge language generation and bounding-box regression via a lightweight grounding head and Hungarian matching, enabling explicit geometric reasoning. The approach, trained on GeoChat and EarthDial, achieves state-of-the-art results across grounding, grounding description, and VQA benchmarks, including a 24.8% relative improvement in visual grounding. This structured, dual-space grounding enhances localization robustness in complex satellite scenes and paves the way for more reliable real-world Earth observation analysis.

Abstract

Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.

SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

TL;DR

SATGround introduces a spatially-aware grounding mechanism for vision-language models in remote sensing by pairing a frozen visual encoder with a finetuned language model and a dedicated grounding module. It uses two special tokens, <bb> and <loc>, to bridge language generation and bounding-box regression via a lightweight grounding head and Hungarian matching, enabling explicit geometric reasoning. The approach, trained on GeoChat and EarthDial, achieves state-of-the-art results across grounding, grounding description, and VQA benchmarks, including a 24.8% relative improvement in visual grounding. This structured, dual-space grounding enhances localization robustness in complex satellite scenes and paves the way for more reliable real-world Earth observation analysis.

Abstract

Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.

Paper Structure

This paper contains 33 sections, 7 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: We propose a novel spatially-aware grounding mechanism for vision-language models (VLMs) in remote sensing. Our approach (green) integrates structured spatial information, leading to significantly improved localization capabilities compared to methods that treat bounding boxes as text. The qualitative examples shown here highlight our model's superior alignment with the ground truth (red) on diverse user prompts, outperforming strong baselines like InternVL chen2024internvl, CVPR'24 (orange), and EarthDial Soni_2025_CVPR, CVPR'25 (yellow).
  • Figure 2: Structured visual grounding. For a given user query (left), we visualize the conventional text-based visual grounding approach (middle), compared to our structured explicit grounding format (right). For illustration purposes, we show the ground-truth bounding box location values (red and orange) in both cases. Instead of returning bounding box coordinates as text, we model a dedicated localization mechanism interfaced by special control tokens $\langle bb \rangle$ and $\langle loc \rangle$, see \ref{['subsec:linkingregression']} for more details.
  • Figure 3: Method overview. We summarize the different components of our model. Visual inputs are encoded using a frozen vision backbone, $\phi_\text{visual}$, and a trainable adapter $\phi_\text{projector}$. The extracted vision tokens are then concatenated with the user query and passed to the language model $\phi_\text{lm}$ based on LLaVA liu2024visual, which we finetune by applying LoRA hu2022lora. The model's response contains both standard text tokens and occasional $\langle bb \rangle$ and $\langle loc \rangle$ tokens, see \ref{['subsec:linkingregression']} for more details. Any resulting $\langle loc \rangle$ feature embeddings are passed to the grounding module $\phi_\text{grounding}$ which produces the final bounding box predictions and matches them by using the Hungarian algorithm.
  • Figure 4: Qualitative comparison for visual grounding. These instances correspond to the quantitative results from \ref{['tab:groundingperformance']}. For each sample, we provide predictions by our approach in green, InternVL chen2024internvl in orange, and EarthDial Soni_2025_CVPR in yellow. The ground truth bounding boxes are shown in red.
  • Figure 5: Greedy sampling
  • ...and 4 more figures