Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

Yi Qi; Xingyu Zhao; Siddartha Khastgir; Xiaowei Huang

Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

Yi Qi, Xingyu Zhao, Siddartha Khastgir, Xiaowei Huang

TL;DR

This study investigates whether large language models (LLMs), exemplified by ChatGPT, can augment safety analysis through Systems Theoretic Process Analysis (STPA) in safety-critical domains. By case-studying Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems, the authors compare three collaboration schemes between humans and ChatGPT, assess input semantic complexity, and evaluate prompt engineering strategies. Key findings show that ChatGPT alone is unreliable for safety analysis, but when integrated via structured workflows and domain-aligned prompts, it can outperform human experts; however, domain-specific prompts yield more conservative, yet more pertinent results, highlighting the need for domain-focused prompt engineering and standardization. The work also addresses trustworthiness, regulatory considerations, and challenges posed by rapid LLM updates, proposing a roadmap for safe, scalable adoption of LLM-assisted safety analysis. Overall, the paper contributes a first empirical exploration of LLMs in STPA, outlines effective collaboration patterns, and provides data/code resources to enable broader replication and standardization efforts in safety-critical contexts.

Abstract

Can safety analysis make use of Large Language Models (LLMs)? A case study explores Systems Theoretic Process Analysis (STPA) applied to Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems using ChatGPT. We investigate how collaboration schemes, input semantic complexity, and prompt guidelines influence STPA results. Comparative results show that using ChatGPT without human intervention may be inadequate due to reliability related issues, but with careful design, it may outperform human experts. No statistically significant differences are found when varying the input semantic complexity or using common prompt guidelines, which suggests the necessity for developing domain-specific prompt engineering. We also highlight future challenges, including concerns about LLM trustworthiness and the necessity for standardisation and regulation in this domain.

Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

TL;DR

Abstract

Paper Structure (44 sections, 3 equations, 6 figures, 6 tables)

This paper contains 44 sections, 3 equations, 6 figures, 6 tables.

Introduction
Motivation
Approach
Key Findings and Contributions
Background
Large Language Models
Systems Theoretic Process Analysis
Methodology
Research Questions
Systems Under Study
Baseline i@: Automatic Emergency Brake Systems
AEB Systems
STPA Results by Human Experts
Baseline ii@: Electricity Demand Side Management Systems
Electricity DSM Systems
...and 29 more sections

Figures (6)

Figure 1: Four-quadrant classification of risks with ways of mitigations
Figure 2: Three ways of incorporating ChatGPT in the workflow of how human safety experts perform STPA: (a) One-off simplex collaboration (b) Recurring simplex collaboration (c) Recurring duplex collaboration.
Figure 3: Control loop structures of three complexity levels for the two baselines, AEB (first row) and DSM (second row) systems.
Figure 4: (a): The Venn diagram of the sets of UCAs for the AEB system. (b): The Venn diagram of the sets of UCAs for the DSM system. The different colour represents the baseline (green), one-off simplex collaboration case (yellow), recurring simplex collaboration case (blue) and recurring duplex collaboration case (orange) respectively. Note: Owing to constraints pertaining to image dimensions, the UCAs has been selectively truncated to display only key elements. For the complete UCAs, cf. Appendix \ref{['app']}
Figure 5: Box and whisker plots of samples for RQ2
...and 1 more figures

Theorems & Definitions (8)

Remark 1: Accuracy despite discrepancy
Remark 2: Unreliability
Remark 3: Propagation and compounding of errors
Remark 4: Graphical outputs
Remark 5: Unrobustness to question phrased
Remark 6: Precise answers from specific questions
Remark 7: Irreproducibility
Remark 8: Comprehensibility from interactivity

Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

TL;DR

Abstract

Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (8)