A Synthetic Dataset for Personal Attribute Inference

Hanna Yukhymenko; Robin Staab; Mark Vero; Martin Vechev

A Synthetic Dataset for Personal Attribute Inference

Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev

TL;DR

This work constructs a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles, and generates SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes.

Abstract

Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users world-wide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose -- the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. We take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes. We validate our dataset with a human study showing that humans barely outperform random guessing on the task of distinguishing our synthetic comments from real ones. Further, we verify that our dataset enables meaningful personal attribute inference research by showing across 18 state-of-the-art LLMs that our synthetic comments allow us to draw the same conclusions as real-world data. Combined, our experimental results, dataset and pipeline form a strong basis for future privacy-preserving research geared towards understanding and mitigating inference-based privacy threats that LLMs pose.

A Synthetic Dataset for Personal Attribute Inference

TL;DR

Abstract

Paper Structure (65 sections, 1 equation, 17 figures, 5 tables, 1 algorithm)

This paper contains 65 sections, 1 equation, 17 figures, 5 tables, 1 algorithm.

Introduction
The PAI Data Gap
This Work
Diversity and Fidelity
Enabling PAI Research
Main Contributions
Background and Related Work
Personal data and Personally Identifiable Information (PII)
Privacy Risks of LLMs
Author Profiling and PAI
Existing Datasets
Synthetic Data Generation with LLMs
Building a Reddit Simulation Environment and Agents
Key Requirements
Simulating Reddit via Personalized LLM Agents
...and 50 more sections

Figures (17)

Figure 1: Overview of our personalized LLM agent-based thread simulation framework and the curation of SynthPAI. First, in step , we create diverse synthetic profiles and seed LLM agents with them. Then, in step , we let the agents interact to generate comment threads. Finally, in step , aided by an LLM, we label the generated comments for inferrable personal attributes.
Figure 2: Profile-conditioned comment generation. Agents generate comments based on the provided context, their synthetic profile and their writing style.
Figure 3: Similarity of individual profiles found in SynthPAI as measured by the exact overlap of their respective personal attribute values.
Figure 4: Personal attribute inference accuracy of $18$ frontier LLMs on SynthPAI. In line with beyond_mem, GPT-4 OpenAI2023GPT4TR is the best performing PAI model. Also, the same scaling laws on model capabilities and PAI performance can be observed as it has been by beyond_mem.
Figure 5: GPT-4 accuracy [%] on personal attribute inference across SynthPAI after anonymization.
...and 12 more figures

A Synthetic Dataset for Personal Attribute Inference

TL;DR

Abstract

A Synthetic Dataset for Personal Attribute Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (17)