Table of Contents
Fetching ...

Evaluation of Code LLMs on Geospatial Code Generation

Piotr Gramacki, Bruno Martins, Piotr Szymański

TL;DR

The paper addresses the challenge of geospatial code generation by introducing an extensible benchmark that categorizes geospatial problems along four dimensions: task complexity, input type, tools usage, and task framing. It provides a dataset of 20 unique tasks (77 samples after augmentation) and an automated evaluation pipeline to measure functional correctness across multiple code LLMs. Through experiments with seven models, the study reveals substantial variability in performance across dimensions, with multi-step tasks and tool coverage posing the greatest challenges, and highlights that models trained on general code may struggle with domain-specific libraries like OSMNX and MovingPandas. The work offers a reproducible benchmark and dataset to spur development of geospatial-focused coding assistants, and outlines clear directions for expanding the benchmark and training specialized models for geospatial data science tasks.

Abstract

Software development support tools have been studied for a long time, with recent approaches using Large Language Models (LLMs) for code generation. These models can generate Python code for data science and machine learning applications. LLMs are helpful for software engineers because they increase productivity in daily work. An LLM can also serve as a "mentor" for inexperienced software developers, and be a viable learning support. High-quality code generation with LLMs can also be beneficial in geospatial data science. However, this domain poses different challenges, and code generation LLMs are typically not evaluated on geospatial tasks. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. We categorised geospatial tasks based on their complexity and required tools. Then, we created a dataset with tasks that test model capabilities in spatial reasoning, spatial data processing, and geospatial tools usage. The dataset consists of specific coding problems that were manually created for high quality. For every problem, we proposed a set of test scenarios that make it possible to automatically check the generated code for correctness. In addition, we tested a selection of existing code generation LLMs for code generation in the geospatial domain. We share our dataset and reproducible evaluation code on a public GitHub repository, arguing that this can serve as an evaluation benchmark for new LLMs in the future. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy. These models will enable the creation of coding assistants tailored for geospatial applications.

Evaluation of Code LLMs on Geospatial Code Generation

TL;DR

The paper addresses the challenge of geospatial code generation by introducing an extensible benchmark that categorizes geospatial problems along four dimensions: task complexity, input type, tools usage, and task framing. It provides a dataset of 20 unique tasks (77 samples after augmentation) and an automated evaluation pipeline to measure functional correctness across multiple code LLMs. Through experiments with seven models, the study reveals substantial variability in performance across dimensions, with multi-step tasks and tool coverage posing the greatest challenges, and highlights that models trained on general code may struggle with domain-specific libraries like OSMNX and MovingPandas. The work offers a reproducible benchmark and dataset to spur development of geospatial-focused coding assistants, and outlines clear directions for expanding the benchmark and training specialized models for geospatial data science tasks.

Abstract

Software development support tools have been studied for a long time, with recent approaches using Large Language Models (LLMs) for code generation. These models can generate Python code for data science and machine learning applications. LLMs are helpful for software engineers because they increase productivity in daily work. An LLM can also serve as a "mentor" for inexperienced software developers, and be a viable learning support. High-quality code generation with LLMs can also be beneficial in geospatial data science. However, this domain poses different challenges, and code generation LLMs are typically not evaluated on geospatial tasks. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. We categorised geospatial tasks based on their complexity and required tools. Then, we created a dataset with tasks that test model capabilities in spatial reasoning, spatial data processing, and geospatial tools usage. The dataset consists of specific coding problems that were manually created for high quality. For every problem, we proposed a set of test scenarios that make it possible to automatically check the generated code for correctness. In addition, we tested a selection of existing code generation LLMs for code generation in the geospatial domain. We share our dataset and reproducible evaluation code on a public GitHub repository, arguing that this can serve as an evaluation benchmark for new LLMs in the future. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy. These models will enable the creation of coding assistants tailored for geospatial applications.
Paper Structure (32 sections, 1 figure, 7 tables)