Table of Contents
Fetching ...

Identifying Non-Replicable Social Science Studies with Language Models

Denitsa Saynova, Kajsa Hansson, Bastiaan Bruinsma, Annika Fredén, Moa Johansson

TL;DR

This paper investigates whether instruction-tuned LLMs can signal the replicability of social science findings by generating 1000 synthetic responses per study from 14 Many Labs 2 experiments and comparing predicted replication labels to human results. It demonstrates that open-source models (Llama 3 8B, Qwen 2 7B, Mistral 7B) can match or exceed GPT-4o in predictive accuracy, achieving up to $77\%$ F1 in some configurations, with higher temperatures generally reducing bias but increasing variance. The work highlights that low-variance samples can inflate effect-size estimates ($d$ magnitudes > $4.0$ in several cases) and that temperature tuning is critical for reliable inferences. Overall, LLMs show promise as a cost-effective screen to prioritize replication efforts in social science, though careful handling of biases and limitations is necessary before deployment in practice.

Abstract

In this study, we investigate whether LLMs can be used to indicate if a study in the behavioural social sciences is replicable. Using a dataset of 14 previously replicated studies (9 successful, 5 unsuccessful), we evaluate the ability of both open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings. We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings. When compared with human replication results for these studies, we achieve F1 values of up to $77\%$ with Mistral 7B, $67\%$ with GPT-4o and Llama 3 8B, and $55\%$ with Qwen 2 7B, suggesting their potential for this task. We also analyse how effect size calculations are affected by sampling temperature and find that low variance (due to temperature) leads to biased effect estimates.

Identifying Non-Replicable Social Science Studies with Language Models

TL;DR

This paper investigates whether instruction-tuned LLMs can signal the replicability of social science findings by generating 1000 synthetic responses per study from 14 Many Labs 2 experiments and comparing predicted replication labels to human results. It demonstrates that open-source models (Llama 3 8B, Qwen 2 7B, Mistral 7B) can match or exceed GPT-4o in predictive accuracy, achieving up to F1 in some configurations, with higher temperatures generally reducing bias but increasing variance. The work highlights that low-variance samples can inflate effect-size estimates ( magnitudes > in several cases) and that temperature tuning is critical for reliable inferences. Overall, LLMs show promise as a cost-effective screen to prioritize replication efforts in social science, though careful handling of biases and limitations is necessary before deployment in practice.

Abstract

In this study, we investigate whether LLMs can be used to indicate if a study in the behavioural social sciences is replicable. Using a dataset of 14 previously replicated studies (9 successful, 5 unsuccessful), we evaluate the ability of both open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings. We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings. When compared with human replication results for these studies, we achieve F1 values of up to with Mistral 7B, with GPT-4o and Llama 3 8B, and with Qwen 2 7B, suggesting their potential for this task. We also analyse how effect size calculations are affected by sampling temperature and find that low variance (due to temperature) leads to biased effect estimates.

Paper Structure

This paper contains 15 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Illustration of a successful LLM replication (after Tversky1981a)
  • Figure 2: Histograms of answer distributions for open-source models compared to human sample and GPT-3.5 sample park2024diminished
  • Figure 3: Effect sizes for the studies successfully replicated by many labs. The left y-axis indicates the different temperatures used. A $*$ indicates that no effect could be calculated due to a correct response effect, while a $\dagger$ indicates that no effect could be calculated because the LLM did not produce useful data.
  • Figure 4: Effect sizes for the studies unsuccessfully replicated by many labs. The left y-axis indicates the different temperatures used. A $*$ indicates that no effect could be calculated due to a correct answer effect, while a $\dagger$ indicates that no effect could be calculated because the LLM did not produce useful data.
  • Figure :
  • ...and 10 more figures