Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Joachim Baumann; Paul Röttger; Aleksandra Urman; Albert Wendsjö; Flor Miriam Plaza-del-Arco; Johannes B. Gruber; Dirk Hovy

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, Dirk Hovy

TL;DR

The paper reveals a systemic vulnerability in using large language models for text annotation: configuration choices across models, prompts, decoding, and mappings can drive incorrect scientific conclusions (LLM hacking). Through a large-scale replication of 37 CSS tasks and 2,361 hypotheses across 18 LLMs, the study shows that intentional manipulation yields near-certain false positives or missed true effects, and that accidental hacking remains common even under reasonable practices. The authors quantify risk across tasks, identify key predictors (notably proximity to significance thresholds and task characteristics), and demonstrate that human annotations and bias-corrected estimators can mitigate risk, albeit with trade-offs in Type I/II errors. They advocate for rigorous validation, pre-registration of configurations, and human-in-the-loop or multiverse approaches to preserve scientific validity when leveraging LLMs for annotation. Collectively, this work provides actionable guidelines and a cautionary framework for deploying LLM-based annotation in social science and related fields.

Abstract

Large language models are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection or prompting strategy). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors. We call this phenomenon where configuration choices lead to incorrect conclusions LLM hacking. We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant. Beyond intentional manipulation, our analysis of 13 million labels from 18 different LLMs across 2361 realistic hypotheses shows that there is also a high risk of accidental LLM hacking, even when following standard research practices. We find incorrect conclusions in approximately 31% of hypotheses for state-of-the-art LLMs, and in half the hypotheses for smaller language models. While higher task performance and stronger general model capabilities reduce LLM hacking risk, even highly accurate models remain susceptible. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of LLM-based findings near significance thresholds. We analyze 21 mitigation techniques and find that human annotations provide crucial protection against false positives. Common regression estimator correction techniques can restore valid inference but trade off Type I vs. Type II errors. We publish a list of practical recommendations to prevent LLM hacking.

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

TL;DR

Abstract

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)

Theorems & Definitions (3)