Table of Contents
Fetching ...

A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

Tianyi Peng, George Gui, Melanie Brucks, Daniel J. Merlau, Grace Jiarui Fan, Malek Ben Sliman, Eric J. Johnson, Abdullah Althenayyan, Silvia Bellezza, Dante Donati, Hortense Fong, Elizabeth Friedman, Ariana Guevara, Mohamed Hussein, Kinshuk Jerath, Bruce Kogut, Akshit Kumar, Kristen Lane, Hannah Li, Vicki Morwitz, Oded Netzer, Patryk Perkowski, Olivier Toubia

TL;DR

The paper conducts a large-scale evaluation of digital twins derived from rich, individual-level data, testing their ability to mirror human responses across diverse domains. Using 19 preregistered sub-studies on a representative U.S. panel, it finds that twins achieve about 75% individual-level accuracy but only modest correlations (~0.2) with human answers, and that correlations improve when personal details are included. Rich input data can outperform some traditional ML benchmarks that need additional data, yet twins do not improve population-level means and remain under-dispersed relative to humans. The authors provide open data and code, discuss domain-specific strengths and weaknesses (notably stronger in social/cognitive domains and weaker in political contexts), and argue for cautious, advisory use of digital twins while highlighting avenues to enhance their predictive pipelines.

Abstract

Digital representations of individuals ("digital twins") promise to transform social science and decision-making. Yet it remains unclear whether such twins truly mirror the people they emulate. We conducted 19 preregistered studies with a representative U.S. panel and their digital twins, each constructed from rich individual-level data, enabling direct comparisons between human and twin behavior across a wide range of domains and stimuli (including never-seen-before ones). Twins reproduced individual responses with 75% accuracy and seemingly low correlation with human answers (approximately 0.2). However, this apparently high accuracy was no higher than that achieved by generic personas based on demographics only. In contrast, correlation improved when twins incorporated detailed personal information, even outperforming traditional machine learning benchmarks that require additional data. Twins exhibited systematic strengths and weaknesses - performing better in social and personality domains, but worse in political ones - and were more accurate for participants with higher education, higher income, and moderate political views and religious attendance. Together, these findings delineate both the promise and the current limits of digital twins: they capture some relative differences among individuals but not yet the unique judgments of specific people. All data and code are publicly available to support the further development and evaluation of digital twin pipelines.

A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

TL;DR

The paper conducts a large-scale evaluation of digital twins derived from rich, individual-level data, testing their ability to mirror human responses across diverse domains. Using 19 preregistered sub-studies on a representative U.S. panel, it finds that twins achieve about 75% individual-level accuracy but only modest correlations (~0.2) with human answers, and that correlations improve when personal details are included. Rich input data can outperform some traditional ML benchmarks that need additional data, yet twins do not improve population-level means and remain under-dispersed relative to humans. The authors provide open data and code, discuss domain-specific strengths and weaknesses (notably stronger in social/cognitive domains and weaker in political contexts), and argue for cautious, advisory use of digital twins while highlighting avenues to enhance their predictive pipelines.

Abstract

Digital representations of individuals ("digital twins") promise to transform social science and decision-making. Yet it remains unclear whether such twins truly mirror the people they emulate. We conducted 19 preregistered studies with a representative U.S. panel and their digital twins, each constructed from rich individual-level data, enabling direct comparisons between human and twin behavior across a wide range of domains and stimuli (including never-seen-before ones). Twins reproduced individual responses with 75% accuracy and seemingly low correlation with human answers (approximately 0.2). However, this apparently high accuracy was no higher than that achieved by generic personas based on demographics only. In contrast, correlation improved when twins incorporated detailed personal information, even outperforming traditional machine learning benchmarks that require additional data. Twins exhibited systematic strengths and weaknesses - performing better in social and personality domains, but worse in political ones - and were more accurate for participants with higher education, higher income, and moderate political views and religious attendance. Together, these findings delineate both the promise and the current limits of digital twins: they capture some relative differences among individuals but not yet the unique judgments of specific people. All data and code are publicly available to support the further development and evaluation of digital twin pipelines.

Paper Structure

This paper contains 145 sections, 2 equations, 45 figures, 1 table.

Figures (45)

  • Figure 1: Mega-Study Overview. We run 19 pre-registered studies on digital twins from the Twin-2K-500 dataset and their human counterparts. The studies were proposed by a diverse group of scholars and cover a wide range of behaviors and domains. As a set, they represent how digital twins may be leveraged today by social scientists. We match the answer of each digital twin to each question with the answer from their human counterpart, allowing us to explore the performance of digital twins both at the individual and population levels.
  • Figure 2: Gains from Leveraging Individual-Level Data. *: best performing benchmark, or not significantly different from best at p$<$0.05 (not applicable to ratio of standard deviations). Creating digital twins using rich individual-level data improves the correlation between twin and human responses compared to synthetic personas based on demographics only. It also increases the variance in responses, although digital twins remain under-dispersed. Individual-level accuracy and predictions of population means are not improved.
  • Figure 3: Human Responses vs. Digital Twin Responses based on Demographics Only (Left) and Human Responses vs. Digital Twin Responses based on Full Persona (Right), for One Particular Outcome. 45° line included. This example illustrates how correlation may be much improved when full personas are used, without significant change to individual-level accuracy.
  • Figure 4: Comparison with Traditional Machine Learning Method. We compare digital twins (blue line) to an XGBoost model (orange line) trained for each outcome using additional data not needed or used by digital twins: responses for that outcome from a training subset of participants. In order for such a traditional machine learning approach to match the averagea out-of-sample predictive correlation achieved by digital twins, one would need to collect each outcome from approximately 180 participants. As the size of the training sample reaches 650, the predictive correlation achieved by XGBoost does not exceed 0.29. An XGBoost model trained only on demographic variables (green line) never reaches the performance of digital twins, even when using up to 650 participants as training sample.
  • Figure 5: Results from Meta-Analysis (Mixed Linear Model with z-transformed Correlation as Dependent Variable). The average correlation across all outcomes is 0.197. Correlation between responses from twins vs. their human counterparts tends to be higher in social domains, except when social desirability is salient. Correlation tends to be higher in the cognitive domain and in domains related to human-technology interactions, but lower in the political domain and when providing valenced evaluations.
  • ...and 40 more figures