Table of Contents
Fetching ...

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Nicolò Pagan, Petter Törnberg, Christopher A. Bail, Anikó Hannák, Christopher Barrie

TL;DR

This work tackles the validation of realism in large language model (LLM) outputs for social science simulations, addressing the limits of human judgments as validation. It introduces a scalable computational Turing test that jointly assesses detectability, semantic fidelity, and interpretable linguistic features, and performs a comprehensive calibration benchmark across nine open-weight LLMs, three social platforms, and multiple optimization strategies. Key findings show AI-generated text remains distinguishable (70–80% accuracy) even after calibration, with affective language and platform-specific cues persisting as reliable discriminators, while some optimization strategies reduce detectability yet degrade semantic fidelity. The results provide a practical framework for validation and calibration of LLM-based social simulations and highlight fundamental trade-offs between making text sound human and preserving its meaning, with broad implications for both research and application.

Abstract

Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

TL;DR

This work tackles the validation of realism in large language model (LLM) outputs for social science simulations, addressing the limits of human judgments as validation. It introduces a scalable computational Turing test that jointly assesses detectability, semantic fidelity, and interpretable linguistic features, and performs a comprehensive calibration benchmark across nine open-weight LLMs, three social platforms, and multiple optimization strategies. Key findings show AI-generated text remains distinguishable (70–80% accuracy) even after calibration, with affective language and platform-specific cues persisting as reliable discriminators, while some optimization strategies reduce detectability yet degrade semantic fidelity. The results provide a practical framework for validation and calibration of LLM-based social simulations and highlight fundamental trade-offs between making text sound human and preserving its meaning, with broad implications for both research and application.

Abstract

Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.

Paper Structure

This paper contains 23 sections, 11 figures.

Figures (11)

  • Figure 1: BERT classifier accuracy for optimal model configurations across datasets and optimization strategies. Lower accuracy indicates better human mimicry (approaching 50% chance level). Non-instruction-tuned models from Llama, Mistral, and Apertus demonstrate superior deception capabilities compared to instruction-tuned variants. Twitter/X content (black bars) proves most susceptible to successful LLM replication across all three social media platforms examined.
  • Figure 2: Distribution of content similarity between the AI-generated response and the ground-truth response, measured via the cosine similarity score using the all-MiniLM-L6-v2 model from the SentenceTransformers library.
  • Figure 3: Results of the feature importance analysis derived from the Random Forest model trained to classify AI-generated vs Human text. The top $10$ most important features are displayed for each dataset and model. For each dataset, features are ordered left to right according to decreasing importance across all models.
  • Figure 4: Empath library analysis fast2016empath. For each dataset, the plot shows up to $20$ most common (across all nine models) statistically significantly different features.
  • Figure 5: BERT classifier accuracy comparing state-of-the-art method (Baseline + Persona) and best-performing configurations across models and datasets. Increasing configuration complexity definitely improves the performance in making the models less detectable. Furthermore, it substantially reduces the differences between the different models. Also, the best performing configuration is either the one that integrates stylistic examples and context (besides the persona description), or the one that also uses a fine-tuned model (especially in the case of instruct-tuned models).
  • ...and 6 more figures