Table of Contents
Fetching ...

Remote Sensing-Oriented World Model

Yuxi Lu, Biao Wu, Zhidong Li, Kunqi Li, Chenya Huang, Huacan Wang, Qizhen Lan, Ronghao Chen, Ling Chen, Bin Liang

TL;DR

The paper addresses the lack of real-world validation for world models in remote sensing by proposing direction-conditioned spatial extrapolation as the core task. It introduces RSWISE, a 1,600-task benchmark across general, flood, urban, and rural scenarios, evaluated with distributional fidelity (FID) and semantic spatial reasoning (GPT-4o). The authors present RemoteBAGEL, a unified multimodal framework trained on action-conditioned data to perform spatial extrapolation with explicit directional control and geographic coherence. Experiments show RemoteBAGEL achieving state-of-the-art performance on RSWISE, including robust generalization to out-of-distribution hurricane scenarios, highlighting the potential for geospatial reasoning in Earth observation applications.

Abstract

World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.

Remote Sensing-Oriented World Model

TL;DR

The paper addresses the lack of real-world validation for world models in remote sensing by proposing direction-conditioned spatial extrapolation as the core task. It introduces RSWISE, a 1,600-task benchmark across general, flood, urban, and rural scenarios, evaluated with distributional fidelity (FID) and semantic spatial reasoning (GPT-4o). The authors present RemoteBAGEL, a unified multimodal framework trained on action-conditioned data to perform spatial extrapolation with explicit directional control and geographic coherence. Experiments show RemoteBAGEL achieving state-of-the-art performance on RSWISE, including robust generalization to out-of-distribution hurricane scenarios, highlighting the potential for geospatial reasoning in Earth observation applications.

Abstract

World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.

Paper Structure

This paper contains 62 sections, 6 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Illustration of direction-conditioned spatial extrapolation for Remote Sensing World Modeling. Given a central observation $x_t$ (the current spatial tile) and a directional instruction $a_t$, the world model learns the underlying geospatial structure and predicts the adjacent, previously unobserved tile $x_{t+1}$. Here, the index $t$ denotes spatial progression rather than temporal evolution, aligning our task with the next-state prediction paradigm used in World Models.
  • Figure 2: Overview of the RSWISE evaluation framework. RSWISE assesses spatial extrapolation quality along three axes-continuity fidelity, semantic transitions, and directional consistency. These axes are jointly operationalized through two complementary metrics: GPT-4o for spatial reasoning and FID for distributional fidelity. The five examples illustrate how models differ across these axes: some achieve low FID yet fail to produce meaningful directional content, while others show strong spatial reasoning and continuity but incorrect directional consistency. RSWISE integrates both aspects to provide a balanced and geospatially grounded assessment of world modeling performance.
  • Figure 3: Start tile is paired with their four cardinal neighbors, yielding evaluation triplets.
  • Figure 4: Illustration of the RemoteBAGEL formulation. (a) Large satellite images are partitioned into overlapping $3\times3$ grids, with example trajectories providing consecutive steps of supervision. (b) Given a central tile and a directional instruction (up, down, left, right), the adjacent tile in the specified direction serves as ground truth, yielding instruction-conditioned triplets. (c) The architecture encodes the input tile and instruction embedding, fuses them via attention, and decodes a continuation tile consistent with the specified direction.
  • Figure 5: Qualitative comparison of rightward continuations across four scenarios (general, flood, urban, rural). RemoteBAGEL produces geospatially consistent extrapolations aligned with the ground truth, whereas other models often generate invalid or semantically inconsistent content.
  • ...and 9 more figures