Incentivizing High-Quality Human Annotations with Golden Questions
Shang Liu, Zhongze Cai, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li
TL;DR
This work addresses incentivizing high-quality human annotations for LLM alignment by embedding annotator behavior into a principal–agent framework. It introduces golden questions and an MLE-based monitoring test to infer an unobserved commitment parameter $\theta_a$, deriving a mini-max rate $Var(\Psi)=\Theta(1/\sqrt{n \log n})$ that reflects the strategic nature of annotators. The authors propose two criteria for selecting effective golden questions—high certainty and format similarity to regular items—and validate them via real data collection and reward-model–driven experiments, showing real golden questions outperform instruction-based checks in distinguishing high-quality annotators. The results provide a principled basis for designing incentive schemes and monitoring data quality in large-scale annotation pipelines, with implications for reward modeling and post-training alignment workflows.
Abstract
Human-annotated data plays a vital role in training large language models (LLMs), such as supervised fine-tuning and human preference alignment. However, it is not guaranteed that paid human annotators produce high-quality data. In this paper, we study how to incentivize human annotators to do so. We start from a principal-agent model to model the dynamics between the company (the principal) and the annotator (the agent), where the principal can only monitor the annotation quality by examining $n$ samples. We investigate the maximum likelihood estimators (MLE) and the corresponding hypothesis testing to incentivize annotators: the agent is given a bonus if the MLE passes the test. By analyzing the variance of the outcome, we show that the strategic behavior of the agent makes the hypothesis testing very different from traditional ones: Unlike the exponential rate proved by the large deviation theory, the principal-agent model's hypothesis testing rate is of $Θ(1/\sqrt{n \log n})$. Our theory implies two criteria for the \emph{golden questions} to monitor the performance of the annotators: they should be of (1) high certainty and (2) similar format to normal ones. In that light, we select a set of golden questions in human preference data. By doing incentive-compatible experiments, we find out that the annotators' behavior is better revealed by those golden questions, compared to traditional survey techniques such as instructed manipulation checks.
