On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software
Ali Nouri, Johan Andersson, Kailash De Jesus Hornig, Zhennan Fei, Emil Knabe, Hakan Sivencrona, Beatriz Cabrero-Daniel, Christian Berger
TL;DR
The paper tackles safe autonomous driving software development by coupling a large language model–based code generator with a simulation-based evaluation loop. A Design Science workflow yields a prototype pipeline that iterates code generation against a minimal world model (esmini) and generates natural language safety assessment reports to guide improvement. Across ACC and CAEM use cases, GPT-4 emerged as the most capable model among those tested, achieving fully functional results in some iterations while open-source models lag behind, underscoring the remaining need for human oversight and robust evaluation. The work demonstrates a model-agnostic, simulation-guided approach that can shorten ADS development cycles while highlighting key limitations, risks, and directions for formal methods and tool qualification for industrial deployment.
Abstract
Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle's environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.
