Table of Contents
Fetching ...

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, Dirk Hovy

TL;DR

The paper reveals a systemic vulnerability in using large language models for text annotation: configuration choices across models, prompts, decoding, and mappings can drive incorrect scientific conclusions (LLM hacking). Through a large-scale replication of 37 CSS tasks and 2,361 hypotheses across 18 LLMs, the study shows that intentional manipulation yields near-certain false positives or missed true effects, and that accidental hacking remains common even under reasonable practices. The authors quantify risk across tasks, identify key predictors (notably proximity to significance thresholds and task characteristics), and demonstrate that human annotations and bias-corrected estimators can mitigate risk, albeit with trade-offs in Type I/II errors. They advocate for rigorous validation, pre-registration of configurations, and human-in-the-loop or multiverse approaches to preserve scientific validity when leveraging LLMs for annotation. Collectively, this work provides actionable guidelines and a cautionary framework for deploying LLM-based annotation in social science and related fields.

Abstract

Large language models are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection or prompting strategy). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors. We call this phenomenon where configuration choices lead to incorrect conclusions LLM hacking. We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant. Beyond intentional manipulation, our analysis of 13 million labels from 18 different LLMs across 2361 realistic hypotheses shows that there is also a high risk of accidental LLM hacking, even when following standard research practices. We find incorrect conclusions in approximately 31% of hypotheses for state-of-the-art LLMs, and in half the hypotheses for smaller language models. While higher task performance and stronger general model capabilities reduce LLM hacking risk, even highly accurate models remain susceptible. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of LLM-based findings near significance thresholds. We analyze 21 mitigation techniques and find that human annotations provide crucial protection against false positives. Common regression estimator correction techniques can restore valid inference but trade off Type I vs. Type II errors. We publish a list of practical recommendations to prevent LLM hacking.

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

TL;DR

The paper reveals a systemic vulnerability in using large language models for text annotation: configuration choices across models, prompts, decoding, and mappings can drive incorrect scientific conclusions (LLM hacking). Through a large-scale replication of 37 CSS tasks and 2,361 hypotheses across 18 LLMs, the study shows that intentional manipulation yields near-certain false positives or missed true effects, and that accidental hacking remains common even under reasonable practices. The authors quantify risk across tasks, identify key predictors (notably proximity to significance thresholds and task characteristics), and demonstrate that human annotations and bias-corrected estimators can mitigate risk, albeit with trade-offs in Type I/II errors. They advocate for rigorous validation, pre-registration of configurations, and human-in-the-loop or multiverse approaches to preserve scientific validity when leveraging LLMs for annotation. Collectively, this work provides actionable guidelines and a cautionary framework for deploying LLM-based annotation in social science and related fields.

Abstract

Large language models are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection or prompting strategy). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors. We call this phenomenon where configuration choices lead to incorrect conclusions LLM hacking. We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant. Beyond intentional manipulation, our analysis of 13 million labels from 18 different LLMs across 2361 realistic hypotheses shows that there is also a high risk of accidental LLM hacking, even when following standard research practices. We find incorrect conclusions in approximately 31% of hypotheses for state-of-the-art LLMs, and in half the hypotheses for smaller language models. While higher task performance and stronger general model capabilities reduce LLM hacking risk, even highly accurate models remain susceptible. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of LLM-based findings near significance thresholds. We analyze 21 mitigation techniques and find that human annotations provide crucial protection against false positives. Common regression estimator correction techniques can restore valid inference but trade off Type I vs. Type II errors. We publish a list of practical recommendations to prevent LLM hacking.

Paper Structure

This paper contains 148 sections, 19 equations, 22 figures, 11 tables.

Figures (22)

  • Figure 1: (left) We quantify LLM hacking risk through systematic replication of 37 diverse computational social science annotation tasks. For these tasks, we create a combined set of 2,361 realistic hypotheses that researchers might test using these annotations. (middle) We collect 13 million LLM annotations across plausible LLM configurations. These annotations feed into 1.4 million regressions testing the hypotheses. (right) For a hypothesis with no true effect (ground truth $p > 0.05$), different LLM configurations yield conflicting conclusions. Checkmarks indicate correct statistical conclusions matching ground truth; crosses indicate LLM hacking -- incorrect conclusions due to LLM annotation errors. Across all experiments, LLM hacking occurs in 31-50% of cases even with highly capable models. Since minor configuration changes can flip scientific conclusions, from correct to incorrect, LLM hacking can be exploited to present virtually anything as statistically significant.
  • Figure 2: Average feasibility rates of LLM hacking (red bars) and correct conclusions (green bars) across annotation tasks. Red bars show the proportion of hypotheses where at least one configuration yields an incorrect conclusion (lower is better, indicating more robust results). Green bars show the proportion where at least one configuration yields the correct conclusion (higher is better, indicating greater potential for accurate conclusions). The top panel shows feasibility rates for hypotheses without significant differences, while the bottom panel shows feasibility rates for hypotheses with significant differences. Analysis restricted to top models includes: Llama-3.1-70B, Qwen2.5-32B, Qwen2.5-72B, Qwen3-32B, Gemma-3-27b, GPT-4o-mini, and GPT-4o.
  • Figure 3: Scaling relationships for LLM hacking risk and annotation performance. Left panel shows task-averaged LLM hacking risk decreasing with model size across all model families, with larger models consistently outperforming smaller ones. Right panel shows corresponding improvements in weighted F1 annotation performance. Both metrics exhibit clear scaling trends, though substantial risk remains even for the largest models.
  • Figure 4: Average weighted F1 scores and LLM hacking risk across all 37 annotation tasks (sorted by decreasing LLM hacking risk). Error bars show 95% confidence intervals.
  • Figure 5: Type M error analysis for seven top models: Relative magnitude errors indicate to what extent estimated effect sizes derived from LLM annotations deviate from true effect sizes in the absence of LLM hacking. Left: The histogram visualizes the distribution of Type M errors. Right: The Cumulative probability shows the chance of landing below a certain error level, as indicated on the x-axis.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Definition 1: LLM Hacking
  • Definition 2: LLM Hacking Risk
  • Definition 3: LLM Hacking Feasibility