Table of Contents
Fetching ...

Does Johnny Get the Message? Evaluating Cybersecurity Notifications for Everyday Users

Victor Jüttner, Erik Buchmann

TL;DR

This work addresses the challenge of making cybersecurity alerts from everyday devices understandable to non-experts by introducing HCSAEF, a seven-dimension evaluation framework for LLM-generated notifications. The framework uses a 5-point Likert scale to assess Consequences, Context, Countermeasures, Correctness, Intuitiveness, Personalization, and Urgency, enabling systematic comparisons across prompts and LLMs. Through case studies on IDS alerts, the authors demonstrate how prompt design, model choice, and output robustness influence notification quality, providing actionable guidance for researchers and practitioners designing usable security communications. They also outline a path toward automation (LLM-as-a-judge) and large-scale evaluation, highlighting the framework's potential to improve security awareness and response in smart-home contexts.

Abstract

Due to the increasing presence of networked devices in everyday life, not only cybersecurity specialists but also end users benefit from security applications such as firewalls, vulnerability scanners, and intrusion detection systems. Recent approaches use large language models (LLMs) to rewrite brief, technical security alerts into intuitive language and suggest actionable measures, helping everyday users understand and respond appropriately to security risks. However, it remains an open question how well such alerts are explained to users. LLM outputs can also be hallucinated, inconsistent, or misleading. In this work, we introduce the Human-Centered Security Alert Evaluation Framework (HCSAEF). HCSAEF assesses LLM-generated cybersecurity notifications to support researchers who want to compare notifications generated for everyday users, improve them, or analyze the capabilities of different LLMs in explaining cybersecurity issues. We demonstrate HCSAEF through three use cases, which allow us to quantify the impact of prompt design, model selection, and output consistency. Our findings indicate that HCSAEF effectively differentiates generated notifications along dimensions such as intuitiveness, urgency, and correctness.

Does Johnny Get the Message? Evaluating Cybersecurity Notifications for Everyday Users

TL;DR

This work addresses the challenge of making cybersecurity alerts from everyday devices understandable to non-experts by introducing HCSAEF, a seven-dimension evaluation framework for LLM-generated notifications. The framework uses a 5-point Likert scale to assess Consequences, Context, Countermeasures, Correctness, Intuitiveness, Personalization, and Urgency, enabling systematic comparisons across prompts and LLMs. Through case studies on IDS alerts, the authors demonstrate how prompt design, model choice, and output robustness influence notification quality, providing actionable guidance for researchers and practitioners designing usable security communications. They also outline a path toward automation (LLM-as-a-judge) and large-scale evaluation, highlighting the framework's potential to improve security awareness and response in smart-home contexts.

Abstract

Due to the increasing presence of networked devices in everyday life, not only cybersecurity specialists but also end users benefit from security applications such as firewalls, vulnerability scanners, and intrusion detection systems. Recent approaches use large language models (LLMs) to rewrite brief, technical security alerts into intuitive language and suggest actionable measures, helping everyday users understand and respond appropriately to security risks. However, it remains an open question how well such alerts are explained to users. LLM outputs can also be hallucinated, inconsistent, or misleading. In this work, we introduce the Human-Centered Security Alert Evaluation Framework (HCSAEF). HCSAEF assesses LLM-generated cybersecurity notifications to support researchers who want to compare notifications generated for everyday users, improve them, or analyze the capabilities of different LLMs in explaining cybersecurity issues. We demonstrate HCSAEF through three use cases, which allow us to quantify the impact of prompt design, model selection, and output consistency. Our findings indicate that HCSAEF effectively differentiates generated notifications along dimensions such as intuitiveness, urgency, and correctness.

Paper Structure

This paper contains 20 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Example of a cybersecurity alert rewritten by GPT-4o into a detailed, user-friendly notification tailored for non-expert homeowners.
  • Figure 2: Comparing Prompt 1 and Prompt 2 with HCSAEF.
  • Figure 3: Comparing different LLMs with HCSAEF.
  • Figure 4: GPT 4o executing Prompt 1 three times.