Table of Contents
Fetching ...

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li

TL;DR

<3-5 sentence high-level summary> RS-EoT addresses the Glance Effect in remote sensing reasoning by wrapping language-driven reasoning around an iterative, evidence-seeking perception loop. It introduces a SocraticAgent to synthesize RS-EoT traces during SFT and then employs a two-stage progressive RL pipeline (Grounding then VQA) with a novel multiple-choice VQA reconstruction to stabilize training. Empirical results show state-of-the-art performance on RS VQA and grounding benchmarks, supported by analyses of attention dynamics and case studies that confirm iterative reasoning and evidence gathering. The work advances trustworthy geospatial AI by enabling genuine, grounded reasoning over large-scale RS imagery.</3-5 sentence high-level summary>

Abstract

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

Asking like Socrates: Socrates helps VLMs understand remote sensing images

TL;DR

<3-5 sentence high-level summary> RS-EoT addresses the Glance Effect in remote sensing reasoning by wrapping language-driven reasoning around an iterative, evidence-seeking perception loop. It introduces a SocraticAgent to synthesize RS-EoT traces during SFT and then employs a two-stage progressive RL pipeline (Grounding then VQA) with a novel multiple-choice VQA reconstruction to stabilize training. Empirical results show state-of-the-art performance on RS VQA and grounding benchmarks, supported by analyses of attention dynamics and case studies that confirm iterative reasoning and evidence gathering. The work advances trustworthy geospatial AI by enabling genuine, grounded reasoning over large-scale RS imagery.</3-5 sentence high-level summary>

Abstract

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

Paper Structure

This paper contains 35 sections, 6 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Illustration of the pseudo reasoning problem and our RS-EoT solution. (a) Existing models show pseudo reasoning: explicit thinking (blue bars) degrades performance below the non-reasoning base model (red bar). (b) We attribute this to the "Glance Effect"—reasoning based on a single, coarse perception. We propose RS-EoT, an iterative evidence-seeking loop. (c) Our model, RS-EoT-7B, successfully solves the task by iteratively reasoning and seeking visual evidence.
  • Figure 2: Overview of our method to instill the RS-EoT paradigm. (Left) SFT: RS-EoT Cold-Start: We propose SocraticAgent to synthesize reasoning traces. A Reasoner (text-only) and a Perceiver (image-aware) engage in an iterative dialogue, guided by a self-play prompting mechanism. (Right) RL: Enhancing and Generalizing RS-EoT: A two-stage progressive RL pipeline. Stage 1 (RL-Grounding) enhances fine-grained evidence-seeking via an IoU-based reward. Building on this, Stage 2 (RL-VQA) generalizes reasoning by converting simple VQA datasets into a multiple-choice format with a graded reward for stable training.
  • Figure 3: Case studies comparing RS-EoT-7B with prior multimodal reasoning models on (top) Remote Sensing General QA and (bottom) Fine-grained Grounding. Unlike previous models, RS-EoT-7B follows the RS-EoT paradigm: it iteratively self-questions, gathers additional visual evidence during reasoning, and uses that evidence to verify or adjust its conclusions.
  • Figure 4: Token-wise attention visualization on eight randomly sampled cases. The y-axis represents the proportion of attention allocated to image tokens, and the x-axis represents the token index during the decoding step. Clear periodic patterns emerge: attention peaks on visual tokens (evidence-seeking phases) and then drops during language-based reasoning (reasoning phases). This alternating cycle reflects the iterative reasoning mechanism instilled by the RS-EoT paradigm.
  • Figure 5: The reward curve for the VQA RL stage. The stable upward trend validates that our multiple-choice data reconstruction strategy provides an effective learning signal and successfully mitigates reward hacking.
  • ...and 10 more figures