Table of Contents
Fetching ...

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

Ali Nouri, Beatriz Cabrero-Daniel, Zhennan Fei, Krishna Ronanki, Håkan Sivencrona, Christian Berger

TL;DR

This work investigates the viability of using large language models to co-generate safety-critical automotive software, addressing the risk that stochastic LLM outputs may introduce unsafe behavior. The authors propose a Software-in-the-Loop (SIL) co-generation pipeline with the esmini simulator to quickly evaluate generated code against safety-driven test cases and rank candidates for human review. They systematically compare six LLMs across four automotive functions (F1–F4) with varying difficulty and collect qualitative failure modes to guide safe adoption and model improvement. Key findings show GPT-4 can produce executable code for the most complex function (CAEM) in some runs, while open-source models struggle, highlighting a need for structured verification and human oversight. The work contributes a practical, reusable evaluation framework and a governance approach to integrate LLM-generated code safely into automotive software development.

Abstract

Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

TL;DR

This work investigates the viability of using large language models to co-generate safety-critical automotive software, addressing the risk that stochastic LLM outputs may introduce unsafe behavior. The authors propose a Software-in-the-Loop (SIL) co-generation pipeline with the esmini simulator to quickly evaluate generated code against safety-driven test cases and rank candidates for human review. They systematically compare six LLMs across four automotive functions (F1–F4) with varying difficulty and collect qualitative failure modes to guide safe adoption and model improvement. Key findings show GPT-4 can produce executable code for the most complex function (CAEM) in some runs, while open-source models struggle, highlighting a need for structured verification and human oversight. The work contributes a practical, reusable evaluation framework and a governance approach to integrate LLM-generated code safely into automotive software development.

Abstract

Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.

Paper Structure

This paper contains 16 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Designed and implemented pipeline, including the LLM model, prompts, and simulation environment. The function description is automatically inserted into the prompt, sent to the LLM, and the generated Python code is extracted from the LLMs' responses. Then, the compilable codes are sent to esmini and tested against relevant test cases for the specific function. Finally, a report is generated and attached to the code as the output of the pipeline.
  • Figure 2: Reports the performance of all LLM models on two simple functions (F1 and F2). The left bar of each model presents the results for F1 (i.e., brake if the speed is higher than 10 m/s²), and the right bar presents the results for F2 (i.e., lane change until reaching the rightmost lane). The performance of the models is ranked first based on the total number of successful codes for F2 and then based on F1, as F2 is considered more complex than F1.
  • Figure 3: Performance of six models on two advanced automotive functions (F3 and F4). The ACC bar in each group indicates the models' performance in F3, and the CAEM bars present the results for F4. The models are ranked first based on total successful generations for F3 and F4, and then by the number of executable code generations.
  • Figure 4: Pipeline in Fig. \ref{['fig:AbsVCodeSim']} integrated into the software engineering process. It enables the pre-evaluation of generated codes, with the best candidates selected for engineers to review and improve before proceeding to the rigorous V&V process (right side). Failed codes are analysed to extract failure modes, helping refine the prompts. To avoid automation bias and evaluate the effectiveness of the review process, the failed codes can be provided to check whether they are detected and excluded.