Table of Contents
Fetching ...

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang

TL;DR

This work introduces GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning, and proposes Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps.

Abstract

While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

TL;DR

This work introduces GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning, and proposes Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps.

Abstract

While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
Paper Structure (27 sections, 8 equations, 9 figures, 10 tables)

This paper contains 27 sections, 8 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Overall framework of our proposed GeoSolver. To enable reliable geospatial reasoning, we first construct the Geo-PRM-2M dataset via (a) Entropy-Guided MCTS and (b) Synthetic Hallucination Injection. The trained GeoPRM is then seamlessly integrated to (c) scale test-time reasoning via advanced search strategies during inference, and (d) align the base policy via Process-Aware Tree-GRPO during training.
  • Figure 2: Illustration of the Process-Aware Tree-GRPO. The reasoning tree dynamically expands by identifying high-entropy tokens, followed by trajectory rollouts. Leaf nodes receive outcome rewards which are further refined by GeoPRM's drop-moment penalty. These process-aware signals are then aggregated upwards to compute Local Advantage ($LA$) and Global Advantage ($GA$) for fine-grained policy optimization.
  • Figure 3: Comparison of different verification strategies on GeoSolver. We evaluate greedy decoding (GeoSolver w/o TTS), Self-Consistency (majority voting), and our GeoPRM utilizing both Best-of-N and Beam Search strategies. The generation budget is set to 32.
  • Figure 4: Cross-model reward model comparison using BoN performance ($N=32$). We evaluate the verification efficacy of various strategies, including Self-Consistency, generic VLMs acting as verifiers, open-source PRMs, and our domain-specific GeoPRM across diverse remote sensing tasks.
  • Figure 5: Compute-optimal scaling behavior of GeoSolver equipped with GeoPRM. The curves illustrate the performance gains on (a) VG and (b) VQA as the compute budget ($N$) increases.
  • ...and 4 more figures