Table of Contents
Fetching ...

Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions

Vijay Prakash, Kevin Lee, Arkaprabha Bhattacharya, Danny Yuxing Huang, Jessica Staddon

TL;DR

The paper interrogates how large language models respond to end-user security questions, highlighting risks such as outdated guidance and indirect communication. Through a qualitative, expert-based evaluation of 900 open-ended queries answered by GPT, Llama, and Gemini, the authors identify a taxonomy of content and communication errors, including misinterpretation, hallucinations, and over-reliance on marketing messaging. They contribute a corpus of questions with authoritative sources, a qualitative evaluation framework, and an expert-labeled response dataset to guide both users and developers. The findings emphasize the need for better alignment with current security research, transparent sourcing, and user-focused prompts, offering practical guidance and open problems to improve the reliability and usefulness of AI-assisted security advice.

Abstract

Answering end user security questions is challenging. While large language models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have shown promise in answering a variety of questions outside of security. We studied LLM performance in the area of end user security by qualitatively evaluating 3 popular LLMs on 900 systematically collected end user security questions. While LLMs demonstrate broad generalist ``knowledge'' of end user security information, there are patterns of errors and limitations across LLMs consisting of stale and inaccurate answers, and indirect or unresponsive communication styles, all of which impacts the quality of information received. Based on these patterns, we suggest directions for model improvement and recommend user strategies for interacting with LLMs when seeking assistance with security.

Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions

TL;DR

The paper interrogates how large language models respond to end-user security questions, highlighting risks such as outdated guidance and indirect communication. Through a qualitative, expert-based evaluation of 900 open-ended queries answered by GPT, Llama, and Gemini, the authors identify a taxonomy of content and communication errors, including misinterpretation, hallucinations, and over-reliance on marketing messaging. They contribute a corpus of questions with authoritative sources, a qualitative evaluation framework, and an expert-labeled response dataset to guide both users and developers. The findings emphasize the need for better alignment with current security research, transparent sourcing, and user-focused prompts, offering practical guidance and open problems to improve the reliability and usefulness of AI-assisted security advice.

Abstract

Answering end user security questions is challenging. While large language models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have shown promise in answering a variety of questions outside of security. We studied LLM performance in the area of end user security by qualitatively evaluating 3 popular LLMs on 900 systematically collected end user security questions. While LLMs demonstrate broad generalist ``knowledge'' of end user security information, there are patterns of errors and limitations across LLMs consisting of stale and inaccurate answers, and indirect or unresponsive communication styles, all of which impacts the quality of information received. Based on these patterns, we suggest directions for model improvement and recommend user strategies for interacting with LLMs when seeking assistance with security.

Paper Structure

This paper contains 67 sections, 7 tables.