Table of Contents
Fetching ...

Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

Yi Qi, Xingyu Zhao, Siddartha Khastgir, Xiaowei Huang

TL;DR

This study investigates whether large language models (LLMs), exemplified by ChatGPT, can augment safety analysis through Systems Theoretic Process Analysis (STPA) in safety-critical domains. By case-studying Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems, the authors compare three collaboration schemes between humans and ChatGPT, assess input semantic complexity, and evaluate prompt engineering strategies. Key findings show that ChatGPT alone is unreliable for safety analysis, but when integrated via structured workflows and domain-aligned prompts, it can outperform human experts; however, domain-specific prompts yield more conservative, yet more pertinent results, highlighting the need for domain-focused prompt engineering and standardization. The work also addresses trustworthiness, regulatory considerations, and challenges posed by rapid LLM updates, proposing a roadmap for safe, scalable adoption of LLM-assisted safety analysis. Overall, the paper contributes a first empirical exploration of LLMs in STPA, outlines effective collaboration patterns, and provides data/code resources to enable broader replication and standardization efforts in safety-critical contexts.

Abstract

Can safety analysis make use of Large Language Models (LLMs)? A case study explores Systems Theoretic Process Analysis (STPA) applied to Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems using ChatGPT. We investigate how collaboration schemes, input semantic complexity, and prompt guidelines influence STPA results. Comparative results show that using ChatGPT without human intervention may be inadequate due to reliability related issues, but with careful design, it may outperform human experts. No statistically significant differences are found when varying the input semantic complexity or using common prompt guidelines, which suggests the necessity for developing domain-specific prompt engineering. We also highlight future challenges, including concerns about LLM trustworthiness and the necessity for standardisation and regulation in this domain.

Safety Analysis in the Era of Large Language Models: A Case Study of STPA using ChatGPT

TL;DR

This study investigates whether large language models (LLMs), exemplified by ChatGPT, can augment safety analysis through Systems Theoretic Process Analysis (STPA) in safety-critical domains. By case-studying Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems, the authors compare three collaboration schemes between humans and ChatGPT, assess input semantic complexity, and evaluate prompt engineering strategies. Key findings show that ChatGPT alone is unreliable for safety analysis, but when integrated via structured workflows and domain-aligned prompts, it can outperform human experts; however, domain-specific prompts yield more conservative, yet more pertinent results, highlighting the need for domain-focused prompt engineering and standardization. The work also addresses trustworthiness, regulatory considerations, and challenges posed by rapid LLM updates, proposing a roadmap for safe, scalable adoption of LLM-assisted safety analysis. Overall, the paper contributes a first empirical exploration of LLMs in STPA, outlines effective collaboration patterns, and provides data/code resources to enable broader replication and standardization efforts in safety-critical contexts.

Abstract

Can safety analysis make use of Large Language Models (LLMs)? A case study explores Systems Theoretic Process Analysis (STPA) applied to Automatic Emergency Brake (AEB) and Electricity Demand Side Management (DSM) systems using ChatGPT. We investigate how collaboration schemes, input semantic complexity, and prompt guidelines influence STPA results. Comparative results show that using ChatGPT without human intervention may be inadequate due to reliability related issues, but with careful design, it may outperform human experts. No statistically significant differences are found when varying the input semantic complexity or using common prompt guidelines, which suggests the necessity for developing domain-specific prompt engineering. We also highlight future challenges, including concerns about LLM trustworthiness and the necessity for standardisation and regulation in this domain.
Paper Structure (44 sections, 3 equations, 6 figures, 6 tables)

This paper contains 44 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Four-quadrant classification of risks with ways of mitigations
  • Figure 2: Three ways of incorporating ChatGPT in the workflow of how human safety experts perform STPA: (a) One-off simplex collaboration (b) Recurring simplex collaboration (c) Recurring duplex collaboration.
  • Figure 3: Control loop structures of three complexity levels for the two baselines, AEB (first row) and DSM (second row) systems.
  • Figure 4: (a): The Venn diagram of the sets of UCAs for the AEB system. (b): The Venn diagram of the sets of UCAs for the DSM system. The different colour represents the baseline (green), one-off simplex collaboration case (yellow), recurring simplex collaboration case (blue) and recurring duplex collaboration case (orange) respectively. Note: Owing to constraints pertaining to image dimensions, the UCAs has been selectively truncated to display only key elements. For the complete UCAs, cf. Appendix \ref{['app']}
  • Figure 5: Box and whisker plots of samples for RQ2
  • ...and 1 more figures

Theorems & Definitions (8)

  • Remark 1: Accuracy despite discrepancy
  • Remark 2: Unreliability
  • Remark 3: Propagation and compounding of errors
  • Remark 4: Graphical outputs
  • Remark 5: Unrobustness to question phrased
  • Remark 6: Precise answers from specific questions
  • Remark 7: Irreproducibility
  • Remark 8: Comprehensibility from interactivity