Table of Contents
Fetching ...

Human-Centred LLM Privacy Audits: Findings and Frictions

Dimitri Staufer, Kirsten Morehouse, David Hartmann, Bettina Berendt

Abstract

Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies ($N_{total}{=}458$), GPT-4o predicts 11 of 50 features for everyday people with $\ge$60\% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model--individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.

Human-Centred LLM Privacy Audits: Findings and Frictions

Abstract

Large language models (LLMs) learn statistical associations from massive training corpora and user interactions, and deployed systems can surface or infer information about individuals. Yet people lack practical ways to inspect what a model associates with their name. We report interim findings from an ongoing study and introduce LMP2, a browser-based self-audit tool. In two user studies (), GPT-4o predicts 11 of 50 features for everyday people with 60\% accuracy, and participants report wanting control over LLM-generated associations despite not considering all outputs privacy violations. To validate our probing method, we evaluate eight LLMs on public figures and non-existent names, observing clear separation between stable name-conditioned associations and model defaults. Our findings also contribute to exposing a broader generative AI evaluation crisis: when outputs are probabilistic, context-dependent, and user-mediated through elicitation, what model--individual associations even include is under-specified and operationalisation relies on crafting probes and metrics that are hard to validate or compare. To move towards reliable, actionable human-centred LLM privacy audits, we identify nine frictions that emerged in our study and offer recommendations for future work and the design of human-centred LLM privacy audits.
Paper Structure (17 sections, 5 figures, 2 tables)

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Walk-through of the LMP2 interface for privacy self-audits of LLMs: Participants use the tool in four stages: (1) enter their full name and agree to terms, (2) select human features from a categorised list, (3) view Results Cards with model predictions and confidence scores, and (4) provide feedback on correctness, privacy concerns, and emotional reactions.
  • Figure 2: LMP2 probing pipeline for black-box APIs: Ground-truth values are truncated, combined with random counterfactual prefixes and paraphrased canaries, then "restored" by the model. Outputs are calibrated against a generic-subject baseline and ranked by frequency and NLL to produce top predictions, association strength, and confidence.
  • Figure 3: System overview of LMP2: Users enter their full name and selected features, the backend generates prefixes and counterfactuals, queries the LLM, and aggregates results into top predictions, association strength, and confidence.
  • Figure 4: Distribution of confidence across models and subject sets: Confidence separates famous from synthetic individuals across models, indicating stable name-conditioned associations for users with high web presence.
  • Figure 5: Empirical evaluation (Famous dataset). Precision vs. recall across models. Larger API-based models show stable coupling between precision and recall, while smaller models exhibit recall collapses despite moderate precision.