Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content
Simon Münker, Nils Schwager, Kai Kugler, Michael Heseltine, Achim Rettinger
TL;DR
This paper investigates whether Large Language Models can reliably mimic authentic social-media behavior, introducing a history-conditioned reply-prediction dataset across English and German to quantify linguistic discrepancies between human and synthetic content. It develops a multi-dimensional evaluation framework—covering quantitative features, morphosyntactic patterns, semantic classifications, cluster-based similarity, and feature-vector distances—and compares base and fine-tuned model generations. Results show that fine-tuned outputs align more closely with human content than prompt-based ones, yet persistent detectable signatures remain, challenging ecological validity. The study advocates mandatory validation protocols, domain-specific fine-tuning, and multi-dimensional evaluation to improve the realism and reliability of LLM-based social simulations in computational social science.
Abstract
The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.
