Table of Contents
Fetching ...

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan

TL;DR

ThinkGeo addresses the need for domain-specific benchmarks to evaluate tool-augmented LLMs in remote sensing. It introduces a comprehensive RS benchmark with 486 multi-step tasks, 14 tools, optical and SAR imagery, and ReAct-style traces, enabling end-to-end and stepwise evaluation. Empirical results across multiple models reveal substantial gaps in tool-use fidelity and geospatial grounding, even for leading models, especially on SAR data. The work provides a foundation and methodology for advancing spatially-aware, tool-enabled LLMs in geospatial workflows.

Abstract

Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Queries are grounded in satellite or aerial imagery, including both optical RGB and SAR data, and require agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 486 structured agentic tasks with 1,773 expert-verified reasoning steps. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing.

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks

TL;DR

ThinkGeo addresses the need for domain-specific benchmarks to evaluate tool-augmented LLMs in remote sensing. It introduces a comprehensive RS benchmark with 486 multi-step tasks, 14 tools, optical and SAR imagery, and ReAct-style traces, enabling end-to-end and stepwise evaluation. Empirical results across multiple models reveal substantial gaps in tool-use fidelity and geospatial grounding, even for leading models, especially on SAR data. The work provides a foundation and methodology for advancing spatially-aware, tool-enabled LLMs in geospatial workflows.

Abstract

Recent progress in large language models (LLMs) has enabled tool-augmented agents capable of solving complex real-world tasks through step-by-step reasoning. However, existing evaluations often focus on general-purpose or multimodal scenarios, leaving a gap in domain-specific benchmarks that assess tool-use capabilities in complex remote sensing use cases. We present ThinkGeo, an agentic benchmark designed to evaluate LLM-driven agents on remote sensing tasks via structured tool use and multi-step planning. Inspired by tool-interaction paradigms, ThinkGeo includes human-curated queries spanning a wide range of real-world applications such as urban planning, disaster assessment and change analysis, environmental monitoring, transportation analysis, aviation monitoring, recreational infrastructure, and industrial site analysis. Queries are grounded in satellite or aerial imagery, including both optical RGB and SAR data, and require agents to reason through a diverse toolset. We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs (e.g., GPT-4o, Qwen2.5) on 486 structured agentic tasks with 1,773 expert-verified reasoning steps. The benchmark reports both step-wise execution metrics and final answer correctness. Our analysis reveals notable disparities in tool accuracy and planning consistency across models. ThinkGeo provides the first extensive testbed for evaluating how tool-enabled LLMs handle spatial reasoning in remote sensing.

Paper Structure

This paper contains 18 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Representative samples from the ThinkGeo benchmark. Each row illustrates a user query grounded in real RS imagery, followed by a ReAct-based execution chain comprising tool calls and reasoning steps, and concludes with the resulting answer. The examples span diverse use cases, including transportation analysis, urban planning, disaster assessment and change analysis, recreational infrastructure, and environmental monitoring, highlighting multi-tool reasoning and spatial task complexity.
  • Figure 2: Overview of task domains and reasoning-tool interaction characteristics in ThinkGeo.
  • Figure 3: End-to-end dataflow for constructing the ThinkGeo benchmark. We begin with expert-curated samples from remote sensing datasets, guided by scenario-specific query design templates. Human annotators inspect images and generate ReAct-style multi-step queries using a semi-automated GPT-powered interface. Each query is validated via expert review and script-based consistency checks. Invalid cases are manually refined. The final dataset consists of JSON-formatted ReAct traces grounded in satellite or aerial imagery.
  • Figure 4: Number of correctly answered queries per model, categorized by difficulty level.
  • Figure 5: The plot illustrates the total number of tool calls made by each model and the corresponding number of tool call errors. The large discrepancy in open source models indicates a high rate of tool misuse. In contrast, models like GPT-4o demonstrate better tool invocation reliability.
  • ...and 6 more figures