Table of Contents
Fetching ...

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du

TL;DR

This paper introduces a dedicated, multi-task spatial benchmark for Large Language Models (LLMs) and evaluates a range of models (e.g., gpt-4o, gpt-4-turbo, moonshot-v1-8k, glm-4) through a two-phase process: zero-shot testing and prompt-strategy tuning. It builds a 900-question dataset across 12 spatial task categories by integrating GIS knowledge with Bloom's taxonomy, validated by GIS experts, and uses a novel Weighted Accuracy (WA) metric to quantify performance. The results show gpt-4o achieving the highest zero-shot WA (0.71), with prompt strategies like Chain-of-Thought dramatically boosting performance on challenging tasks (e.g., simple route planning), while some models benefit less or even degrade under certain prompts. The study also highlights the role of dataset difficulty classification and the varying impact of prompting across architectures, providing a rigorous benchmark to guide future model development and targeted prompting for spatial reasoning tasks. $WA = \frac{2 \cdot n(s2) + 1 \cdot n(s1)}{2 \cdot (n(s0) + n(s1) + n(s2))}$ appears as a central evaluation measure, illustrating how partial and full correctness are weighted to reflect task difficulty and model capability.

Abstract

The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o's accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

TL;DR

This paper introduces a dedicated, multi-task spatial benchmark for Large Language Models (LLMs) and evaluates a range of models (e.g., gpt-4o, gpt-4-turbo, moonshot-v1-8k, glm-4) through a two-phase process: zero-shot testing and prompt-strategy tuning. It builds a 900-question dataset across 12 spatial task categories by integrating GIS knowledge with Bloom's taxonomy, validated by GIS experts, and uses a novel Weighted Accuracy (WA) metric to quantify performance. The results show gpt-4o achieving the highest zero-shot WA (0.71), with prompt strategies like Chain-of-Thought dramatically boosting performance on challenging tasks (e.g., simple route planning), while some models benefit less or even degrade under certain prompts. The study also highlights the role of dataset difficulty classification and the varying impact of prompting across architectures, providing a rigorous benchmark to guide future model development and targeted prompting for spatial reasoning tasks. appears as a central evaluation measure, illustrating how partial and full correctness are weighted to reflect task difficulty and model capability.

Abstract

The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o's accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
Paper Structure (31 sections, 1 equation, 8 figures, 6 tables)

This paper contains 31 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Construction of task categories in spatial task datasets.
  • Figure 2: Number of questions per category in the dataset (total number of questions: 900).
  • Figure 3: An example of conducting a single round of dialog with the gpt-3.5-turbo model via an API call.
  • Figure 4: Comparison of model results in zero-shot testing.
  • Figure 5: Accuracy of questions clustered according to difficulty. (G3.5t:gpt-3.5-turbo,G4o:gpt-4o,G4t:gpt-4-turbo-2024-0409,Cs:claude-3-sonnet-20240229,Ms:moonshot-v1-8k,Glm:glm-4)
  • ...and 3 more figures