Large Language Models in Code Co-generation for Safe Autonomous Vehicles

Ali Nouri; Beatriz Cabrero-Daniel; Zhennan Fei; Krishna Ronanki; Håkan Sivencrona; Christian Berger

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

Ali Nouri, Beatriz Cabrero-Daniel, Zhennan Fei, Krishna Ronanki, Håkan Sivencrona, Christian Berger

TL;DR

This work investigates the viability of using large language models to co-generate safety-critical automotive software, addressing the risk that stochastic LLM outputs may introduce unsafe behavior. The authors propose a Software-in-the-Loop (SIL) co-generation pipeline with the esmini simulator to quickly evaluate generated code against safety-driven test cases and rank candidates for human review. They systematically compare six LLMs across four automotive functions (F1–F4) with varying difficulty and collect qualitative failure modes to guide safe adoption and model improvement. Key findings show GPT-4 can produce executable code for the most complex function (CAEM) in some runs, while open-source models struggle, highlighting a need for structured verification and human oversight. The work contributes a practical, reusable evaluation framework and a governance approach to integrate LLM-generated code safely into automotive software development.

Abstract

Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

TL;DR

Abstract

Large Language Models in Code Co-generation for Safe Autonomous Vehicles

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)