Table of Contents
Fetching ...

On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software

Ali Nouri, Johan Andersson, Kailash De Jesus Hornig, Zhennan Fei, Emil Knabe, Hakan Sivencrona, Beatriz Cabrero-Daniel, Christian Berger

TL;DR

The paper tackles safe autonomous driving software development by coupling a large language model–based code generator with a simulation-based evaluation loop. A Design Science workflow yields a prototype pipeline that iterates code generation against a minimal world model (esmini) and generates natural language safety assessment reports to guide improvement. Across ACC and CAEM use cases, GPT-4 emerged as the most capable model among those tested, achieving fully functional results in some iterations while open-source models lag behind, underscoring the remaining need for human oversight and robust evaluation. The work demonstrates a model-agnostic, simulation-guided approach that can shorten ADS development cycles while highlighting key limitations, risks, and directions for formal methods and tool qualification for industrial deployment.

Abstract

Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle's environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.

On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software

TL;DR

The paper tackles safe autonomous driving software development by coupling a large language model–based code generator with a simulation-based evaluation loop. A Design Science workflow yields a prototype pipeline that iterates code generation against a minimal world model (esmini) and generates natural language safety assessment reports to guide improvement. Across ACC and CAEM use cases, GPT-4 emerged as the most capable model among those tested, achieving fully functional results in some iterations while open-source models lag behind, underscoring the remaining need for human oversight and robust evaluation. The work demonstrates a model-agnostic, simulation-guided approach that can shorten ADS development cycles while highlighting key limitations, risks, and directions for formal methods and tool qualification for industrial deployment.

Abstract

Automated Driving System (ADS) is a safety-critical software system responsible for the interpretation of the vehicle's environment and making decisions accordingly. The unbounded complexity of the driving context, including unforeseeable events, necessitate continuous improvement, often achieved through iterative DevOps processes. However, DevOps processes are themselves complex, making these improvements both time- and resource-intensive. Automation in code generation for ADS using Large Language Models (LLM) is one potential approach to address this challenge. Nevertheless, the development of ADS requires rigorous processes to verify, validate, assess, and qualify the code before it can be deployed in the vehicle and used. In this study, we developed and evaluated a prototype for automatic code generation and assessment using a designed pipeline of a LLM-based agent, simulation model, and rule-based feedback generator in an industrial setup. The LLM-generated code is evaluated automatically in a simulation model against multiple critical traffic scenarios, and an assessment report is provided as feedback to the LLM for modification or bug fixing. We report about the experimental results of the prototype employing Codellama:34b, DeepSeek (r1:32b and Coder:33b), CodeGemma:7b, Mistral:7b, and GPT4 for Adaptive Cruise Control (ACC) and Unsupervised Collision Avoidance by Evasive Manoeuvre (CAEM). We finally assessed the tool with 11 experts at two Original Equipment Manufacturers (OEMs) by conducting an interview study.

Paper Structure

This paper contains 23 sections, 5 figures.

Figures (5)

  • Figure 1: Presenting the design and engineering cycles in this study: The research goal is investigated in an industrial setting together with a literature review, and in the final stage, validated by an interview study. After multiple design cycles, where the designed pipeline (treatment) was improved and the generated codes, simulations, and test case reports were closely monitored. In the engineering cycle, the final treatment was validated through multiple experiments and evaluated by industrial experts in an interview study.
  • Figure 2: The implementation for iterative automated LLM-based code generation using a simulation model for safety evaluation and for improving the generated code: The pipeline receives the function description and test cases (e.g., TC4 and TC5 depicted on the left side), and safety acceptance criteria. Initially, the pipeline uses a Specification Prompt to generate the first version of the code. This code is then sent to the simulation model, and a test report is generated in natural language. This report is used in the Correction Prompt (template on the right side) to generate subsequent versions of the code based on the initial version. The loop (Simulation-LLM conversation) continues until the code passes all test cases or reaches the maximum iteration limit. The user can also initiate a fresh start of the pipeline, generating a new controller without considering feedback from previous iterations. Each generated version of the controller is compared to the selected baseline using the generated test report.
  • Figure 3: Illustration of the baseline selection strategy across generated code versions over multiple iterations of the proposed feedback mechanism at step $t_n$. The code version $C_{2n-1}$ represent the initial version in each cycle, generated without any feedback. Through refinement via the feedback mechanism, an enhanced version $C_{2n}$ is generated and evaluated automatically. The newly generated version is then compared with the baseline on the number of successful test cases, which serves as a measure of the code’s robustness and safety in handling safety-related scenarios. The pipeline can continue this process both horizontally (i.e., generating new code) and vertically (i.e., refining existing code based on generated test report) until the code passes all tests (e.g., $C_{26}$), referred to as the gold baseline, or reaches a maturity level (e.g., $C_{3}$) that can be further refined by the engineer. There may also be instances where the generated code is non-executable (e.g., $C_{10}Ne$) for one or more test cases in the simulation due to runtime or syntax errors.
  • Figure 4: Visualization of the progression of generated codes and their selected baseline at each step. Black arrows show performance improvements through simulation-based feedback to LLM, while red dashed arrows highlight regressions between initial versions and subsequent refinements. The gold baseline (e.g., ($C_{26}$)), indicating the version that passed all test cases successfully.
  • Figure 5: Performance comparison of different LLM models in the designed pipeline for ACC and CAEM. Each model generate 20 codes for each function. Codes are checked for syntax errors (red dotted bars) and then evaluated on all test cases. If the code passes all test cases, the process is completed (green bars with horizontal lines). If the compiled code fails one or more test cases (orange bars with diagonal lines), it is sent to correction attempt and tested if lead to a correct code (green bars with diagonal lines). The models are ordered from left (lowest performance) to right (highest performance) based on two criteria: (1) the number of codes that passed all test cases and (2) if none of the codes are successful, then based on the number of compilable codes.