Remote Sensing-Oriented World Model
Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, Bin Liang
TL;DR
The paper addresses the lack of real-world validation for world models in remote sensing by proposing direction-conditioned spatial extrapolation as the core task. It introduces RSWISE, a 1,600-task benchmark across general, flood, urban, and rural scenarios, evaluated with distributional fidelity (FID) and semantic spatial reasoning (GPT-4o). The authors present RemoteBAGEL, a unified multimodal framework trained on action-conditioned data to perform spatial extrapolation with explicit directional control and geographic coherence. Experiments show RemoteBAGEL achieving state-of-the-art performance on RSWISE, including robust generalization to out-of-distribution hurricane scenarios, highlighting the potential for geospatial reasoning in Earth observation applications.
Abstract
World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.
