Identifying Non-Replicable Social Science Studies with Language Models
Denitsa Saynova, Kajsa Hansson, Bastiaan Bruinsma, Annika Fredén, Moa Johansson
TL;DR
This paper investigates whether instruction-tuned LLMs can signal the replicability of social science findings by generating 1000 synthetic responses per study from 14 Many Labs 2 experiments and comparing predicted replication labels to human results. It demonstrates that open-source models (Llama 3 8B, Qwen 2 7B, Mistral 7B) can match or exceed GPT-4o in predictive accuracy, achieving up to $77\%$ F1 in some configurations, with higher temperatures generally reducing bias but increasing variance. The work highlights that low-variance samples can inflate effect-size estimates ($d$ magnitudes > $4.0$ in several cases) and that temperature tuning is critical for reliable inferences. Overall, LLMs show promise as a cost-effective screen to prioritize replication efforts in social science, though careful handling of biases and limitations is necessary before deployment in practice.
Abstract
In this study, we investigate whether LLMs can be used to indicate if a study in the behavioural social sciences is replicable. Using a dataset of 14 previously replicated studies (9 successful, 5 unsuccessful), we evaluate the ability of both open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings. We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings. When compared with human replication results for these studies, we achieve F1 values of up to $77\%$ with Mistral 7B, $67\%$ with GPT-4o and Llama 3 8B, and $55\%$ with Qwen 2 7B, suggesting their potential for this task. We also analyse how effect size calculations are affected by sampling temperature and find that low variance (due to temperature) leads to biased effect estimates.
