GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots
Simranjit Singh, Michael Fore, Dimitrios Stamoulis
TL;DR
GeoLLM-Engine addresses the mismatch between simplistic EO benchmarks and real analyst workflows by introducing a realistic, tool-augmented geospatial environment and a scalable, GPT-driven benchmarking engine. It formalizes the environment as $\mathcal{E}=\langle \mathcal{S}, \mathcal{A}, \mathcal{T} \rangle$ with deterministic transitions and a comprehensive tool space, enabling automatic ground-truth verification via a model-checking framework. A self-consistency sampling scheme and 250 instantiated queries across 174 tools drive ground-truth generation, yielding a large-scale benchmark of over $\sim$500k tasks on $1.1$ million satellite images. Key findings show that task complexity, rather than dataset size, governs agent performance, with ReAct generally outperforming CoT in multi-tool settings and GPT-4-Turbo delivering the best results, informing future design of geospatial copilots and benchmarks.
Abstract
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
