Table of Contents
Fetching ...

PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Minjia Wang, Yunfeng Wang, Xiao Ma, Dexin Lv, Qifan Guo, Lynn Zheng, Benliang Wang, Lei Wang, Jiannan Li, Yongwei Xing, David Xu, Zheng Sun

Abstract

Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.

PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Abstract

Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.
Paper Structure (37 sections, 12 figures, 5 tables)

This paper contains 37 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: An overview of our methods. The Persona Agent generates a basic profile from a demographic prior, and iteratively adding realistic attributes to it. The Event Agent retrieves seed events from the event memory and aligns them to the persona, and brainstorms with self-reflection to generate an event forest that serves as the scaffolding of the digital footprints. The Artifact Generator Agent and Critic Agents forms a loop - the Artifact Generator Agent generates the outline and then digital artifacts, and the critic agents provides feedback to iteratively improve the quality of the artifacts.
  • Figure 2: UMAP visualization of the dataset embeddings related to text messages and conversations. Synthetic datasets are denoted by circles, while real datasets are represented by triangles. Among the synthetic datasets, PersonaTrace appears closest in the embedding space to the private dataset and u-sticker, indicating a higher degree of realism and alignment with real-world digital communications.
  • Figure 3: Comparison of diversity and realism between the agent-ablated and full implementations. For Pairwise Correlation, lower values indicate greater diversity. For Remote-Clique and Entropy, higher values reflect greater diversity. Average number of links per email and average email length are used as indicators of realism, with higher values suggesting closer resemblance to human-generated emails.
  • Figure 4: Prompts for LLM-As-judge evaluation.
  • Figure 5: Prompt for generating comprehensive and culturally grounded profile.
  • ...and 7 more figures