Table of Contents
Fetching ...

Usefulness of data flow diagrams and large language models for security threat validation: a registered report

Winnie Bahati Mbaka, Katja Tuma

TL;DR

The paper tackles the challenge of validating security threats in threat modeling by evaluating how much and what kind of analysis material improves threat validation. It proposes a controlled, partially factorial experiment combining Data Flow Diagrams (DFDs) and Large Language Model (LLM) advice across two realistic scenarios, with ground-truth threats derived from STRIDE and analyzed via Helmert contrasts and nonparametric tests. A pilot with 41 MSc students informs the design and future work, including plans for practitioner studies and a replication package to enable future validation and extension. The findings from the pilot suggest that while LLMs may boost recall and threat identification, they can also raise false positives, and the presence of DFDs alone may not significantly alter performance, indicating nuanced guidance for integrating LLMs and graphical models into threat validation workflows.

Abstract

The arrival of recent cybersecurity standards has raised the bar for security assessments in organizations, but existing techniques don't always scale well. Threat analysis and risk assessment are used to identify security threats for new or refactored systems. Still, there is a lack of definition-of-done, so identified threats have to be validated which slows down the analysis. Existing literature has focused on the overall performance of threat analysis, but no previous work has investigated how deep must the analysts dig into the material before they can effectively validate the identified security threats. We propose a controlled experiment with practitioners to investigate whether some analysis material (like LLM-generated advice) is better than none and whether more material (the system's data flow diagram and LLM-generated advice) is better than some material. In addition, we present key findings from running a pilot with 41 MSc students, which are used to improve the study design. Finally, we also provide an initial replication package, including experimental material and data analysis scripts and a plan to extend it to include new materials based on the final data collection campaign with practitioners (e.g., pre-screening questions).

Usefulness of data flow diagrams and large language models for security threat validation: a registered report

TL;DR

The paper tackles the challenge of validating security threats in threat modeling by evaluating how much and what kind of analysis material improves threat validation. It proposes a controlled, partially factorial experiment combining Data Flow Diagrams (DFDs) and Large Language Model (LLM) advice across two realistic scenarios, with ground-truth threats derived from STRIDE and analyzed via Helmert contrasts and nonparametric tests. A pilot with 41 MSc students informs the design and future work, including plans for practitioner studies and a replication package to enable future validation and extension. The findings from the pilot suggest that while LLMs may boost recall and threat identification, they can also raise false positives, and the presence of DFDs alone may not significantly alter performance, indicating nuanced guidance for integrating LLMs and graphical models into threat validation workflows.

Abstract

The arrival of recent cybersecurity standards has raised the bar for security assessments in organizations, but existing techniques don't always scale well. Threat analysis and risk assessment are used to identify security threats for new or refactored systems. Still, there is a lack of definition-of-done, so identified threats have to be validated which slows down the analysis. Existing literature has focused on the overall performance of threat analysis, but no previous work has investigated how deep must the analysts dig into the material before they can effectively validate the identified security threats. We propose a controlled experiment with practitioners to investigate whether some analysis material (like LLM-generated advice) is better than none and whether more material (the system's data flow diagram and LLM-generated advice) is better than some material. In addition, we present key findings from running a pilot with 41 MSc students, which are used to improve the study design. Finally, we also provide an initial replication package, including experimental material and data analysis scripts and a plan to extend it to include new materials based on the final data collection campaign with practitioners (e.g., pre-screening questions).
Paper Structure (25 sections, 1 equation, 2 tables)