GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots

Simranjit Singh; Michael Fore; Dimitrios Stamoulis

GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots

Simranjit Singh, Michael Fore, Dimitrios Stamoulis

TL;DR

GeoLLM-Engine addresses the mismatch between simplistic EO benchmarks and real analyst workflows by introducing a realistic, tool-augmented geospatial environment and a scalable, GPT-driven benchmarking engine. It formalizes the environment as $\mathcal{E}=\langle \mathcal{S}, \mathcal{A}, \mathcal{T} \rangle$ with deterministic transitions and a comprehensive tool space, enabling automatic ground-truth verification via a model-checking framework. A self-consistency sampling scheme and 250 instantiated queries across 174 tools drive ground-truth generation, yielding a large-scale benchmark of over $\sim$500k tasks on $1.1$ million satellite images. Key findings show that task complexity, rather than dataset size, governs agent performance, with ReAct generally outperforming CoT in multi-tool settings and GPT-4-Turbo delivering the best results, informing future design of geospatial copilots and benchmarks.

Abstract

Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.

GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots

TL;DR

with deterministic transitions and a comprehensive tool space, enabling automatic ground-truth verification via a model-checking framework. A self-consistency sampling scheme and 250 instantiated queries across 174 tools drive ground-truth generation, yielding a large-scale benchmark of over

500k tasks on

million satellite images. Key findings show that task complexity, rather than dataset size, governs agent performance, with ReAct generally outperforming CoT in multi-tool settings and GPT-4-Turbo delivering the best results, informing future design of geospatial copilots and benchmarks.

Abstract

Paper Structure (8 sections, 5 figures, 6 tables)

This paper contains 8 sections, 5 figures, 6 tables.

Introduction
GeoLLM-Engine Environment
GeoLLM-Engine Benchmark Suite
Remote Sensing Datasets
Results
Related Work
Limitations and Future Work
Conclusion

Figures (5)

Figure 1: GeoLLM-Engine offers a realistic web environment equipped with comprehensive APIs for developing, deploying, and evaluating geospatial Copilots on tasks that authentically reflect the workflows of remote sensing analysts. Unlike existing LLM benchmarks which rely on simplified question-answer pairs, GeoLLM-Engine introduces nuanced, long-horizon natural-language commands over a wide range of tasks - from object detection and land classification to document retrieval and web search.
Figure 2: Paper overview:GeoLLM-Engine is a realistic environment for building and evaluating agents on complex EO tasks. Our GPT-driven benchmark-engine and the agent-correctness checker minimize the need for human intervention for benchmark curation. This allows us to develop a large-scale benchmark with more than $500,000$long-horizon geospatial tasks on $1,100,000$ satellite images.
Figure 3: Dataset overview with a wide spatial and temporal range. The satellite imagery serves as task contexts for LLM agents to perform function calls (e.g., given image coordinates, we can assess the agents' ability to fetch the proper set of images around a user-specified location) and is not employed for any LLM-finetuning or other downstream tasks, enabling our research-purposes investigation.
Figure 4: Success Rate vs. Task Complexity (left), Benchmark size (middle), and Error Tolerance (right). All methods use GPT-4-Turbo.
Figure 5: Distribution of different errors in correctness rate for CoT and ReAct methods. All methods use GPT-4

GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots

TL;DR

Abstract

GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots

Authors

TL;DR

Abstract

Table of Contents

Figures (5)