Evaluating Tool-Augmented Agents in Remote Sensing Platforms
Simranjit Singh, Michael Fore, Dimitrios Stamoulis
TL;DR
GeoLLM-QA addresses the need for evaluating tool-augmented LLMs in remote sensing under realistic, user-grounded tasks rather than template-based QA. It introduces a 1,000-task benchmark executed on a live geospatial UI with a 117-tool toolkit and three RS datasets, using oracle detectors to isolate agent decisions from detector errors. The benchmark uses a three-step data-generation process (reference templates, LLM-guided perturbations, and ground-truth generation with retrieval augmentation) and a multi-metric evaluation suite including success rate, correctness, ROUGE-L, token cost, and recall. Key findings show GPT-4 Turbo with function-calling delivers strong performance, CoT and ReAct generally outperform Chameleon, ROUGE-L varies with prompting, and recall reveals gaps in detection; the work provides a foundation for developing stronger geospatial copilots and emphasizes open-sourcing and future directions such as replacing detectors and reducing human-in-the-loop overhead.
Abstract
Tool-augmented Large Language Models (LLMs) have shown impressive capabilities in remote sensing (RS) applications. However, existing benchmarks assume question-answering input templates over predefined image-text data pairs. These standalone instructions neglect the intricacies of realistic user-grounded tasks. Consider a geospatial analyst: they zoom in a map area, they draw a region over which to collect satellite imagery, and they succinctly ask "Detect all objects here". Where is `here`, if it is not explicitly hardcoded in the image-text template, but instead is implied by the system state, e.g., the live map positioning? To bridge this gap, we present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform. Through in-depth evaluation of state-of-the-art LLMs over a diverse set of 1,000 tasks, we offer insights towards stronger agents for RS applications.
