Table of Contents
Fetching ...

Text-Based Personas for Simulating User Privacy Decisions

Kassem Fawaz, Ren Yi, Octavian Suciu, Rishabh Khandelwal, Hamza Harkous, Nina Taft, Marco Gruteser

Abstract

The ability to simulate human privacy decisions has significant implications for aligning autonomous agents with individual intent and conducting cost-effective, large-scale privacy-centric user studies. Prior approaches prompt Large Language Models (LLMs) with natural language user statements, data-sharing histories, or demographic attributes to simulate privacy decisions. These approaches, however, fail to balance individual-level accuracy, prompt usability, token efficiency, and population-level representation. We present Narriva, an approach that generates text-based synthetic privacy personas to address these shortcomings. Narriva grounds persona generation in prior user privacy decisions, such as those from large-scale survey datasets, rather than purely relying on demographic stereotypes. It compresses this data into concise, human-readable summaries structured by established privacy theories. Through benchmarking across five diverse datasets, we analyze the characteristics of Narriva's synthetic personas in modeling both individual and population-level privacy preferences. We find that grounding personas in past privacy behaviors achieves up to 88% predictive accuracy (significantly outperforming a non-personalized LLM baseline), and yields an 80-95% reduction in prompt tokens compared to in-context learning with raw examples. Finally, we demonstrate that personas synthesized from a single survey can reproduce the aggregate privacy behaviors and statistical distributions (TVComplement up to 0.85) of entirely different studies.

Text-Based Personas for Simulating User Privacy Decisions

Abstract

The ability to simulate human privacy decisions has significant implications for aligning autonomous agents with individual intent and conducting cost-effective, large-scale privacy-centric user studies. Prior approaches prompt Large Language Models (LLMs) with natural language user statements, data-sharing histories, or demographic attributes to simulate privacy decisions. These approaches, however, fail to balance individual-level accuracy, prompt usability, token efficiency, and population-level representation. We present Narriva, an approach that generates text-based synthetic privacy personas to address these shortcomings. Narriva grounds persona generation in prior user privacy decisions, such as those from large-scale survey datasets, rather than purely relying on demographic stereotypes. It compresses this data into concise, human-readable summaries structured by established privacy theories. Through benchmarking across five diverse datasets, we analyze the characteristics of Narriva's synthetic personas in modeling both individual and population-level privacy preferences. We find that grounding personas in past privacy behaviors achieves up to 88% predictive accuracy (significantly outperforming a non-personalized LLM baseline), and yields an 80-95% reduction in prompt tokens compared to in-context learning with raw examples. Finally, we demonstrate that personas synthesized from a single survey can reproduce the aggregate privacy behaviors and statistical distributions (TVComplement up to 0.85) of entirely different studies.
Paper Structure (63 sections, 10 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 63 sections, 10 equations, 15 figures, 11 tables, 1 algorithm.

Figures (15)

  • Figure 1: High-level overview of Narriva, which comprises three stages: initial persona generation, light-weight optimization, and prediction using the optimized persona.
  • Figure 2: Individual- and population-level performance of text-based personas using 80:20 generation/evaluation split. Error bars show 95% confidence intervals obtained via bootstrapping. We report performance of summarized persona derived from question-answer pairs (Narrative), against that of the original question-answer pairs (Raw) and the LLM's internal privacy baseline without personalization (Baseline).
  • Figure 3: The fraction of individuals across the five datasets for which each specific privacy theory template yields the highest predictive accuracy.
  • Figure 4: Comparison of individual accuracy and population-level $TVD$ when using the Basic text-based persona versus dynamically selecting the Best-performing theoretical template per respondent. Error bars are obtained via bootstrapping.
  • Figure 5: Performance when generating from behavioral questions and evaluating on behavioral questions, across prompt templates. Error bars are obtained via bootstrapping.
  • ...and 10 more figures