Table of Contents
Fetching ...

A Synthetic Dataset for Personal Attribute Inference

Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev

TL;DR

This work constructs a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles, and generates SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes.

Abstract

Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users world-wide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose -- the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. We take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes. We validate our dataset with a human study showing that humans barely outperform random guessing on the task of distinguishing our synthetic comments from real ones. Further, we verify that our dataset enables meaningful personal attribute inference research by showing across 18 state-of-the-art LLMs that our synthetic comments allow us to draw the same conclusions as real-world data. Combined, our experimental results, dataset and pipeline form a strong basis for future privacy-preserving research geared towards understanding and mitigating inference-based privacy threats that LLMs pose.

A Synthetic Dataset for Personal Attribute Inference

TL;DR

This work constructs a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles, and generates SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes.

Abstract

Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users world-wide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose -- the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. We take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes. We validate our dataset with a human study showing that humans barely outperform random guessing on the task of distinguishing our synthetic comments from real ones. Further, we verify that our dataset enables meaningful personal attribute inference research by showing across 18 state-of-the-art LLMs that our synthetic comments allow us to draw the same conclusions as real-world data. Combined, our experimental results, dataset and pipeline form a strong basis for future privacy-preserving research geared towards understanding and mitigating inference-based privacy threats that LLMs pose.
Paper Structure (65 sections, 1 equation, 17 figures, 5 tables, 1 algorithm)

This paper contains 65 sections, 1 equation, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: Overview of our personalized LLM agent-based thread simulation framework and the curation of SynthPAI. First, in step , we create diverse synthetic profiles and seed LLM agents with them. Then, in step , we let the agents interact to generate comment threads. Finally, in step , aided by an LLM, we label the generated comments for inferrable personal attributes.
  • Figure 2: Profile-conditioned comment generation. Agents generate comments based on the provided context, their synthetic profile and their writing style.
  • Figure 3: Similarity of individual profiles found in SynthPAI as measured by the exact overlap of their respective personal attribute values.
  • Figure 4: Personal attribute inference accuracy of $18$ frontier LLMs on SynthPAI. In line with beyond_mem, GPT-4 OpenAI2023GPT4TR is the best performing PAI model. Also, the same scaling laws on model capabilities and PAI performance can be observed as it has been by beyond_mem.
  • Figure 5: GPT-4 accuracy [%] on personal attribute inference across SynthPAI after anonymization.
  • ...and 12 more figures