Does Johnny Get the Message? Evaluating Cybersecurity Notifications for Everyday Users
Victor Jüttner, Erik Buchmann
TL;DR
This work addresses the challenge of making cybersecurity alerts from everyday devices understandable to non-experts by introducing HCSAEF, a seven-dimension evaluation framework for LLM-generated notifications. The framework uses a 5-point Likert scale to assess Consequences, Context, Countermeasures, Correctness, Intuitiveness, Personalization, and Urgency, enabling systematic comparisons across prompts and LLMs. Through case studies on IDS alerts, the authors demonstrate how prompt design, model choice, and output robustness influence notification quality, providing actionable guidance for researchers and practitioners designing usable security communications. They also outline a path toward automation (LLM-as-a-judge) and large-scale evaluation, highlighting the framework's potential to improve security awareness and response in smart-home contexts.
Abstract
Due to the increasing presence of networked devices in everyday life, not only cybersecurity specialists but also end users benefit from security applications such as firewalls, vulnerability scanners, and intrusion detection systems. Recent approaches use large language models (LLMs) to rewrite brief, technical security alerts into intuitive language and suggest actionable measures, helping everyday users understand and respond appropriately to security risks. However, it remains an open question how well such alerts are explained to users. LLM outputs can also be hallucinated, inconsistent, or misleading. In this work, we introduce the Human-Centered Security Alert Evaluation Framework (HCSAEF). HCSAEF assesses LLM-generated cybersecurity notifications to support researchers who want to compare notifications generated for everyday users, improve them, or analyze the capabilities of different LLMs in explaining cybersecurity issues. We demonstrate HCSAEF through three use cases, which allow us to quantify the impact of prompt design, model selection, and output consistency. Our findings indicate that HCSAEF effectively differentiates generated notifications along dimensions such as intuitiveness, urgency, and correctness.
