Table of Contents
Fetching ...

How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das

TL;DR

Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, it is found that there remains substantial room for improvement: all models score between 50 and 64 on average, and newer, bigger, and smarter models do not reliably do better and sometimes do worse.

Abstract

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats. If correct, these simulations could offer a scalable way to forecast S&P risks in products prior to deployment. We interrogate this assumption using SP-ABCBench, a new benchmark of 30 tests derived from validated S&P human-subject studies, which measures alignment between simulations and human-subjects studies on a 0-100 ascending scale, where higher scores indicate better alignment across three dimensions: Attitude, Behavior, and Coherence. Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, we found that there remains substantial room for improvement: all models score between 50 and 64 on average. Newer, bigger, and smarter models do not reliably do better and sometimes do worse. Some simulation configurations, however, do yield high alignment: e.g., with scores above 95 for some behavior tests when agents are prompted to apply bounded rationality and weigh privacy costs against perceived benefits. We release SP-ABCBench to enable reproducible evaluation as methods improve.

How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?

TL;DR

Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, it is found that there remains substantial room for improvement: all models score between 50 and 64 on average, and newer, bigger, and smarter models do not reliably do better and sometimes do worse.

Abstract

A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats. If correct, these simulations could offer a scalable way to forecast S&P risks in products prior to deployment. We interrogate this assumption using SP-ABCBench, a new benchmark of 30 tests derived from validated S&P human-subject studies, which measures alignment between simulations and human-subjects studies on a 0-100 ascending scale, where higher scores indicate better alignment across three dimensions: Attitude, Behavior, and Coherence. Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, we found that there remains substantial room for improvement: all models score between 50 and 64 on average. Newer, bigger, and smarter models do not reliably do better and sometimes do worse. Some simulation configurations, however, do yield high alignment: e.g., with scores above 95 for some behavior tests when agents are prompted to apply bounded rationality and weigh privacy costs against perceived benefits. We release SP-ABCBench to enable reproducible evaluation as methods improve.
Paper Structure (86 sections, 3 equations, 15 figures, 9 tables, 2 algorithms)

This paper contains 86 sections, 3 equations, 15 figures, 9 tables, 2 algorithms.

Figures (15)

  • Figure 1: Experimental pipeline for population-level S&P simulation. Left: We generate synthetic participants by sampling U.S. Census--based demographics and constructing personas using four strategies of increasing specificity (RQ2). Center: Persona-endowed LLM agents (RQ1) complete S&P survey instruments or decision tasks, optionally guided by theory-informed prompting (RQ3). Right: Aggregated agent responses are compared against published human-subject results to evaluate alignment across Attitude, Behavior, and Coherence.
  • Figure 2: Distribution of simulation quality scores. (A) Overall distribution across all tests using demographic attributes only (N = 360, mean = 58.92, SD = 27.38). (B) Density plots by test type show distinct patterns: Attitude tests cluster tightly around 65--75, Behavior tests show lower and more variable scores, and Coherence tests exhibit a bimodal distribution.
  • Figure 3: Models ranked by average simulation quality across all tests. Error bars show standard error. The 13-point spread from GPT-4.1 (50.36) to Gemini-2.5-Flash-Lite (63.63) indicates model choice affects alignment, but no model achieves uniformly high fidelity.
  • Figure 4: Top-performing models across Attitude, Behavior, and Coherence tests. Rankings differ by dimension: GPT-4.1-Nano leads Attitude, Gemini-3.0-Flash leads Behavior, and Gemini-2.5-Flash-Lite leads Coherence. No single model dominates across all three dimensions.
  • Figure 5: Simulation quality trends across model characteristics. Scale does not monotonically improve simulation quality; medium reasoning performs comparably to minimal reasoning; newer generations yield small average gains; open-source models trail proprietary models only slightly.
  • ...and 10 more figures