Table of Contents
Fetching ...

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yuhao Zhou, Di Wang, Yifan Zhang, Haoyu Wang, Haiyan Zhao, Hongda Sun, Long Lan, Jun Song, Yulin Wang, Jing Zhang, Wenlong Zhang, Bo Du

TL;DR

This paper proposes a staged knowledge injection recipe: cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures, and pre-warming on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.

Abstract

Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark.Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval.Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.

Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

TL;DR

This paper proposes a staged knowledge injection recipe: cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures, and pre-warming on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL.

Abstract

Multimodal reasoning for ultra-high-resolution (UHR) remote sensing (RS) is usually bottlenecked by visual evidence acquisition: the model necessitates localizing tiny task-relevant regions in massive pixel spaces. While Agentic Reinforcement Learning with Verifiable Rewards (RLVR) using zoom-in tools offers a path forward, we find that standard reinforcement learning struggles to navigate these vast visual spaces without structured domain priors. In this paper, we investigate the interplay between post-training paradigms: comparing Cold-start Supervised Fine-Tuning (SFT), RLVR, and Agentic RLVR on the UHR RS benchmark.Our controlled studies yield a counter-intuitive finding: high-quality Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Despite lacking images, domain-specific text injects the concepts, mechanistic explanations, and decision rules necessary to guide visual evidence retrieval.Based on this, we propose a staged knowledge injection recipe: (1) cold-starting with scalable, knowledge-graph-verified Earth-science text QA to instill reasoning structures;and (2) "pre-warming" on the same hard UHR image-text examples during SFT to stabilize and amplify subsequent tool-based RL. This approach achieves a 60.40% Pass@1 on XLRS-Bench, significantly outperforming larger general purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) and establishing a new state-of-the-art.
Paper Structure (21 sections, 5 equations, 6 figures, 6 tables)

This paper contains 21 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We investigate the interplay between post-training paradigms and found that Earth-science text-only QA is a primary driver of UHR visual reasoning gains. Finally, Agentic RLVR, trained on our datasets with our method, significantly outperforms existing MLLMs on UHR RS tasks.
  • Figure 2: The impact of domain data on performance across different training methods.
  • Figure 3: The impact of cold-start SFT and RL-stage domain knowledge injection on Pass@1 and Pass@32 in Agentic RLVR. We only use the VQA data (SuperRs-VQA) as the Domain Data. The findings of ① ② ③ are detailed in the main text. Cold-start SFT with domain data yields consistent improvements in both average performance (Pass@1) and reasoning boundary (Pass@32), whereas incorporating domain knowledge only during RL produces smaller or less stable gains, highlighting the importance of staged image-text knowledge incorporation.
  • Figure 4: Effects of domain knowledge modality and injection stage on Agentic RLVR performance.
  • Figure 5: Automated pipeline for Earth-science text QA generation.Panel A shows textbook-based construction that produces candidate exercise-style QA grounded in foundational concepts through corpus collection, cleaning, normalization, and reasoning refinement. Panel B shows paper-based construction that produces candidate literature-style QA targeting frontier topics and complex reasoning through paper parsing, task categorization, and template-guided generation with multi-stage checks. Panel C builds a textbook-derived knowledge graph from the cleaned textbook corpus and uses it to screen and validate candidates from Panels A and B for domain relevance as well as factual and logical consistency.
  • ...and 1 more figures