Table of Contents
Fetching ...

Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, João P. Matos-Carvalho

TL;DR

The paper tackles whether large language models can translate a complete ODD ABM specification into executable Python while preserving stochastic dynamics and enabling replication. It introduces a controlled experimental pipeline that uses the PPHPC reference ABM, a diverse set of LLMs, and model-independent statistical validation against a NetLogo baseline, complemented by runtime and static-code-quality assessments. Findings show that a subset of hosted models (notably GPT-4.1) can produce statistically faithful implementations, but success is inconsistent across models and parameter settings, and validity does not guarantee practical performance or maintainability. The work underscores the promise of LLM-assisted model engineering for reproducible ABMs while highlighting current limitations and the need for rigorous, multi-criterion evaluation in deployment contexts.

Abstract

Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and environmental modelling.

Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

TL;DR

The paper tackles whether large language models can translate a complete ODD ABM specification into executable Python while preserving stochastic dynamics and enabling replication. It introduces a controlled experimental pipeline that uses the PPHPC reference ABM, a diverse set of LLMs, and model-independent statistical validation against a NetLogo baseline, complemented by runtime and static-code-quality assessments. Findings show that a subset of hosted models (notably GPT-4.1) can produce statistically faithful implementations, but success is inconsistent across models and parameter settings, and validity does not guarantee practical performance or maintainability. The work underscores the promise of LLM-assisted model engineering for reproducible ABMs while highlighting current limitations and the need for rigorous, multi-criterion evaluation in deployment contexts.

Abstract

Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and environmental modelling.
Paper Structure (20 sections, 7 figures, 6 tables)

This paper contains 20 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The NetLogo baseline implementation of PPHPC. The interface displays controls for initialization, parameterization, and execution (left), time series of population sizes (prey, predators, and available cell-bound food) and mean agent energies over simulation iterations (center-left), alongside a spatial view of the toroidal grid showing agent and resource distributions (right).
  • Figure 2: Execution and validation pipeline for LLM-generated PPHPC implementations from an ODD-including prompt and trial seed. The LLM response is first checked for the required code/function (score 1 otherwise), then subjected to a short smoke test (5 iterations; param. set 1) to detect syntax, runtime/timeout, or output format failures (scores 2--4, respectively). Surviving implementations are executed in 30 stochastic replications of 4000 iterations under each of two parameter sets, and their outputs are statistically compared against the NetLogo baseline; disagreement for either parameter set yields score 5, while statistical indistinguishability for both yields score 6.
  • Figure 3: Distribution of implementation outcome scores (1--6) over six trials (different random seeds) for Python code generated by each LLM when implementing the PPHPC simulation model from its ODD protocol description. Higher scores indicate later stages reached in each trial, with a score of 6 denoting success. Models shown in bold achieved success (score = 6) for at least one trial/seed.
  • Figure 4: Distribution of mean execution times (seconds) for all successful (score 6) LLM-generated PPHPC implementations over trials/seeds, shown separately for parameter sets 1 and 2. Each data point corresponds to the mean runtime $\bar{t}$ of 30 stochastic replications for a single model-trial/seed combination (as reported in Table \ref{['tab:times']}); thus, the box plots summarize a distribution of means rather than individual run times. The NetLogo baseline is included for reference. The logarithmic $x$-axis highlights variation in computational cost both between models and over successful trials for the same model, indicating whether LLMs tend to produce implementations with highly divergent efficiency.
  • Figure 5: Distribution of code quality metrics for all successful (score 6) LLM-generated PPHPC implementations over trials/seeds, shown as box plots for each model. Panels report (top to bottom) $s_\mathrm{loc}$ (source lines of code), $c_c$ (cyclomatic complexity; lower is better), $m_i$ (maintainability index; 0--100; higher is better), $e_t/100$ (type warnings per 100 $s_\mathrm{loc}$), and $e_F/100$ (flaws and formatting warnings per 100 $s_\mathrm{loc}$ indicating potential code quality issues). This figure complements Table \ref{['tab:metrics']} by displaying between-trial variability in code structure and static analysis warnings, highlighting whether a given LLM tends to produce consistently similar implementations or highly divergent code among its successful trials.
  • ...and 2 more figures