Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks
Jiaman He, Zikang Leng, Dana McKay, Damiano Spina, Johanne R. Trippas
TL;DR
The paper tackles whether LLMs can mimic human judgment in multi-annotator text labeling tasks. It introduces a substitution-based evaluation framework using Krippendorff’s $\alpha$, paired bootstrap, and the Two One-Sided Tests (TOST) to test equivalence between LLM and human annotators. Applying the method to MovieLens 100K and PolitiFact shows the LLM is statistically indistinguishable from humans in MovieLens ($p=0.004$) but not in PolitiFact ($p=0.155$), highlighting task-dependence. The work provides a practical toolkit for early-small-sample evaluation and releases a dataset of LLM and human annotations to support ongoing research into scalable, human-aligned annotation processes.
Abstract
Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $α$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.
