A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement
Tianyi Peng, George Gui, Melanie Brucks, Daniel J. Merlau, Grace Jiarui Fan, Malek Ben Sliman, Eric J. Johnson, Abdullah Althenayyan, Silvia Bellezza, Dante Donati, Hortense Fong, Elizabeth Friedman, Ariana Guevara, Mohamed Hussein, Kinshuk Jerath, Bruce Kogut, Akshit Kumar, Kristen Lane, Hannah Li, Vicki Morwitz, Oded Netzer, Patryk Perkowski, Olivier Toubia
TL;DR
The paper conducts a large-scale evaluation of digital twins derived from rich, individual-level data, testing their ability to mirror human responses across diverse domains. Using 19 preregistered sub-studies on a representative U.S. panel, it finds that twins achieve about 75% individual-level accuracy but only modest correlations (~0.2) with human answers, and that correlations improve when personal details are included. Rich input data can outperform some traditional ML benchmarks that need additional data, yet twins do not improve population-level means and remain under-dispersed relative to humans. The authors provide open data and code, discuss domain-specific strengths and weaknesses (notably stronger in social/cognitive domains and weaker in political contexts), and argue for cautious, advisory use of digital twins while highlighting avenues to enhance their predictive pipelines.
Abstract
Digital representations of individuals ("digital twins") promise to transform social science and decision-making. Yet it remains unclear whether such twins truly mirror the people they emulate. We conducted 19 preregistered studies with a representative U.S. panel and their digital twins, each constructed from rich individual-level data, enabling direct comparisons between human and twin behavior across a wide range of domains and stimuli (including never-seen-before ones). Twins reproduced individual responses with 75% accuracy and seemingly low correlation with human answers (approximately 0.2). However, this apparently high accuracy was no higher than that achieved by generic personas based on demographics only. In contrast, correlation improved when twins incorporated detailed personal information, even outperforming traditional machine learning benchmarks that require additional data. Twins exhibited systematic strengths and weaknesses - performing better in social and personality domains, but worse in political ones - and were more accurate for participants with higher education, higher income, and moderate political views and religious attendance. Together, these findings delineate both the promise and the current limits of digital twins: they capture some relative differences among individuals but not yet the unique judgments of specific people. All data and code are publicly available to support the further development and evaluation of digital twin pipelines.
