Table of Contents
Fetching ...

Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data

Minghao Guo, Ziyi Ye, Wujiang Xu, Xi Zhu, Wenyue Hua, Dimitris N. Metaxas

TL;DR

A case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years and proposes the Individual Turing Test to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer.

Abstract

Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.

Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data

TL;DR

A case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years and proposes the Individual Turing Test to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer.

Abstract

Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.
Paper Structure (19 sections, 1 equation, 2 figures, 2 tables)

This paper contains 19 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison between General and Individual Turing Tests. We report selection rates for each method under two evaluation settings: (a) General Turing Test, where responses are judged by strangers, and (b) Individual Turing Test, where responses are judged by acquaintances of the target individual. Results are stacked by prompt type, distinguishing daily conversations and personal opinion prompts.
  • Figure 2: Effect of expanding recent memory on hybrid simulation (A-Mem + LoRA). $Y_i$ denotes using the most recent $i$ years of dialogue history (e.g., $Y_1$ = most recent year only; $Y_{10}$ = most recent ten years).