Evaluating Tool-Augmented Agents in Remote Sensing Platforms

Simranjit Singh; Michael Fore; Dimitrios Stamoulis

Evaluating Tool-Augmented Agents in Remote Sensing Platforms

Simranjit Singh, Michael Fore, Dimitrios Stamoulis

TL;DR

GeoLLM-QA addresses the need for evaluating tool-augmented LLMs in remote sensing under realistic, user-grounded tasks rather than template-based QA. It introduces a 1,000-task benchmark executed on a live geospatial UI with a 117-tool toolkit and three RS datasets, using oracle detectors to isolate agent decisions from detector errors. The benchmark uses a three-step data-generation process (reference templates, LLM-guided perturbations, and ground-truth generation with retrieval augmentation) and a multi-metric evaluation suite including success rate, correctness, ROUGE-L, token cost, and recall. Key findings show GPT-4 Turbo with function-calling delivers strong performance, CoT and ReAct generally outperform Chameleon, ROUGE-L varies with prompting, and recall reveals gaps in detection; the work provides a foundation for developing stronger geospatial copilots and emphasizes open-sourcing and future directions such as replacing detectors and reducing human-in-the-loop overhead.

Abstract

Tool-augmented Large Language Models (LLMs) have shown impressive capabilities in remote sensing (RS) applications. However, existing benchmarks assume question-answering input templates over predefined image-text data pairs. These standalone instructions neglect the intricacies of realistic user-grounded tasks. Consider a geospatial analyst: they zoom in a map area, they draw a region over which to collect satellite imagery, and they succinctly ask "Detect all objects here". Where is `here`, if it is not explicitly hardcoded in the image-text template, but instead is implied by the system state, e.g., the live map positioning? To bridge this gap, we present GeoLLM-QA, a benchmark designed to capture long sequences of verbal, visual, and click-based actions on a real UI platform. Through in-depth evaluation of state-of-the-art LLMs over a diverse set of 1,000 tasks, we offer insights towards stronger agents for RS applications.

Evaluating Tool-Augmented Agents in Remote Sensing Platforms

TL;DR

Abstract

Evaluating Tool-Augmented Agents in Remote Sensing Platforms

Authors

TL;DR

Abstract

Table of Contents

Figures (3)