Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study
Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, João P. Matos-Carvalho
TL;DR
The paper tackles whether large language models can translate a complete ODD ABM specification into executable Python while preserving stochastic dynamics and enabling replication. It introduces a controlled experimental pipeline that uses the PPHPC reference ABM, a diverse set of LLMs, and model-independent statistical validation against a NetLogo baseline, complemented by runtime and static-code-quality assessments. Findings show that a subset of hosted models (notably GPT-4.1) can produce statistically faithful implementations, but success is inconsistent across models and parameter settings, and validity does not guarantee practical performance or maintainability. The work underscores the promise of LLM-assisted model engineering for reproducible ABMs while highlighting current limitations and the need for rigorous, multi-criterion evaluation in deployment contexts.
Abstract
Large language models (LLMs) can now synthesize non-trivial executable code from textual descriptions, raising an important question: can LLMs reliably implement agent-based models from standardized specifications in a way that supports replication, verification, and validation? We address this question by evaluating 17 contemporary LLMs on a controlled ODD-to-code translation task, using the PPHPC predator-prey model as a fully specified reference. Generated Python implementations are assessed through staged executability checks, model-independent statistical comparison against a validated NetLogo baseline, and quantitative measures of runtime efficiency and maintainability. Results show that behaviorally faithful implementations are achievable but not guaranteed, and that executability alone is insufficient for scientific use. GPT-4.1 consistently produces statistically valid and efficient implementations, with Claude 3.7 Sonnet performing well but less reliably. Overall, the findings clarify both the promise and current limitations of LLMs as model engineering tools, with implications for reproducible agent-based and environmental modelling.
