Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT
Yi Qi, Xingyu Zhao, Siddartha Khastgir, Xiaowei Huang
TL;DR
This study investigates whether large language models (LLMs), exemplified by ChatGPT, can augment safety analysis through Systems Theoretic Process Analysis (STPA) in safety-critical domains. By case-studying Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems, the authors compare three collaboration schemes between humans and ChatGPT, assess input semantic complexity, and evaluate prompt engineering strategies. Key findings show that ChatGPT alone is unreliable for safety analysis, but when integrated via structured workflows and domain-aligned prompts, it can outperform human experts; however, domain-specific prompts yield more conservative, yet more pertinent results, highlighting the need for domain-focused prompt engineering and standardization. The work also addresses trustworthiness, regulatory considerations, and challenges posed by rapid LLM updates, proposing a roadmap for safe, scalable adoption of LLM-assisted safety analysis. Overall, the paper contributes a first empirical exploration of LLMs in STPA, outlines effective collaboration patterns, and provides data/code resources to enable broader replication and standardization efforts in safety-critical contexts.
Abstract
Can safety analysis make use of Large Language Models (LLMs)? A case study explores Systems Theoretic Process Analysis (STPA) applied to Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems using ChatGPT. We investigate how collaboration schemes, input semantic complexity, and prompt guidelines influence STPA results. Comparative results show that using ChatGPT without human intervention may be inadequate due to reliability related issues, but with careful design, it may outperform human experts. No statistically significant differences are found when varying the input semantic complexity or using common prompt guidelines, which suggests the necessity for developing domain-specific prompt engineering. We also highlight future challenges, including concerns about LLM trustworthiness and the necessity for standardisation and regulation in this domain.
