Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Kristina Gligorić; Tijana Zrnic; Cinoo Lee; Emmanuel J. Candès; Dan Jurafsky

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Kristina Gligorić, Tijana Zrnic, Cinoo Lee, Emmanuel J. Candès, Dan Jurafsky

TL;DR

This paper tackles the challenge of deriving valid statistical inferences in NLP when annotations are partially provided by large language models (LLMs). It introduces Confidence-Driven Inference, an adaptive framework that uses LLM annotations and calibrated verbalized confidence to guide selective human annotation with a budget constraint, producing an unbiased estimator $\hat{\theta}^{\mathrm{conf}}$ and a valid confidence interval at level $1-\alpha$. By training a per-instance error predictor from LLM confidence and sampling accordingly, the method achieves substantial gains in effective sample size while maintaining coverage, outperforming non-adaptive and human-only baselines across five targets in politeness, stance, and political bias. The approach is model-free and broadly applicable to standard NLP estimation tasks, enabling cost-effective yet statistically valid inferences in computational social science and beyond.

Abstract

Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings--text politeness, stance, and bias--reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

TL;DR

and a valid confidence interval at level

. By training a per-instance error predictor from LLM confidence and sampling accordingly, the method achieves substantial gains in effective sample size while maintaining coverage, outperforming non-adaptive and human-only baselines across five targets in politeness, stance, and political bias. The approach is model-free and broadly applicable to standard NLP estimation tasks, enabling cost-effective yet statistically valid inferences in computational social science and beyond.

Abstract

Paper Structure (34 sections, 10 equations, 4 figures, 5 tables)

This paper contains 34 sections, 10 equations, 4 figures, 5 tables.

Introduction
Background
LLMs for Data Annotation Tasks
Collaborative Annotation Paradigms
Valid Statistical Inferences in NLP
Methods
Problem Setup
Confidence-Driven Inference
Baselines
Human + LLM (non-adaptive).
Human only.
LLM only.
Evaluation Metrics
Effective sample size.
Coverage.
...and 19 more sections

Figures (4)

Figure 1: Illustration of Confidence-Driven Inference. Given a text corpus and a quantity of interest $\theta^*$, (1) we collect LLM annotations and indicators of LLM confidence, based on which we strategically choose a small number of human annotations; (2) we then produce an unbiased estimate $\hat{\theta}^{\mathrm{conf}}$ and a valid confidence interval, allowing valid downstream conclusions.
Figure 2: Confidence intervals, effective sample size, and coverage. Rows correspond to different estimation tasks. The first column shows the confidence intervals in five random trials. The vertical dashed line corresponds to the estimate produced on the full dataset. A method is valid if its confidence interval includes this estimate (in about 90% of the trials), and tighter intervals around $\theta^*$ indicates better performance. The second and third columns display the effective sample size $n_{\mathrm{effective}}$ and coverage, respectively, for different values of the human annotation budget $n_{\mathrm{human}}$. Results are estimated over 100 trials.
Figure 3: Histograms and calibration curves of verbalized confidence scores. (Left) Confidence score histograms across the three settings (GPT-4o). (Right) LLM annotation accuracy with respect to human annotations (y-axis), among instances where the confidence score is greater than C (x-axis) across the three settings (GPT-4o).
Figure 4: Confidence intervals, effective sample size, and coverage (GPT-3.5). Rows correspond to different estimation tasks. The first column shows the confidence intervals in five random trials. The vertical dashed line corresponds to the estimate produced on the full dataset. A method is valid if its confidence interval includes this estimate (in about 90% of the trials), and tighter intervals around $\theta^*$ indicates better performance. The second and third columns display the effective sample size $n_{\mathrm{effective}}$ and coverage, respectively, for different values of the human annotation budget $n_{\mathrm{human}}$. Results are estimated over 100 trials.

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

TL;DR

Abstract

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)