Table of Contents
Fetching ...

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, Yiqun Liu

TL;DR

TwinVoice presents a comprehensive, multi-dimensional benchmark to evaluate LLM-based persona simulation for digital twins. It integrates three real-world axes—Social, Interpersonal, and Narrative—with a fine-grained six-capability framework and paired discriminative and generative evaluations, including LLM-as-a-Judge. Empirical results show strong surface-level lexical and opinion alignment but persistent gaps in memory recall and stylistic tone, with human baselines remaining higher. The work provides rigorous evaluation protocols, open-source resources, and a roadmap for advancing personalized AI systems and robust digital-twin capabilities.

Abstract

Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

TL;DR

TwinVoice presents a comprehensive, multi-dimensional benchmark to evaluate LLM-based persona simulation for digital twins. It integrates three real-world axes—Social, Interpersonal, and Narrative—with a fine-grained six-capability framework and paired discriminative and generative evaluations, including LLM-as-a-Judge. Empirical results show strong surface-level lexical and opinion alignment but persistent gaps in memory recall and stylistic tone, with human baselines remaining higher. The work provides rigorous evaluation protocols, open-source resources, and a roadmap for advancing personalized AI systems and robust digital-twin capabilities.

Abstract

Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.

Paper Structure

This paper contains 90 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The conceptual framework of TwinVoice: (Left) The evaluation is structured across three core dimensions that represent distinct aspects of identity expression: Social Persona (public facing), Interpersonal Persona (private interaction), and Narrative Persona (fictional scenarios). The LLMs are prompted with a person's historical context to simulate their behavior. The LLM's ability for persona simulation is categorized into six fundamental capabilities. (Right) Experimental results averaged over three dimensions are presented.
  • Figure 2: TwinVoice experiment evaluation overview: Top: The LLMs are prompted with a specific persona's history and tasked with a stimulus. Bottom: Three evaluation protocols: Discriminative: the model chooses among A–D, one of which is the ground truth persona behavior. Generative-Ranking: the model writes and an LLM‑as‑a-Judge selects the best candidate, yielding Acc.(Gen). Generative–Scoring: the model writes and the Judge rates similarity on opinion, logic, and style, yielding Score(Gen).
  • Figure 3: Performance across six capabilities. Each panel shows one capability. For each model, bars give scores on the three dimensions—Social, Interpersonal, and Narrative. Purple diamonds indicate the mean across the three dimensions for that model. The y-axis is the average over the three evaluation protocols: discriminative, generative ranking, and generative scoring. The gray dashed line denotes chance level (25%).
  • Figure :
  • Figure :
  • ...and 1 more figures