Table of Contents
Fetching ...

Incentivizing High-Quality Human Annotations with Golden Questions

Shang Liu, Zhongze Cai, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li

TL;DR

This work addresses incentivizing high-quality human annotations for LLM alignment by embedding annotator behavior into a principal–agent framework. It introduces golden questions and an MLE-based monitoring test to infer an unobserved commitment parameter $\theta_a$, deriving a mini-max rate $Var(\Psi)=\Theta(1/\sqrt{n \log n})$ that reflects the strategic nature of annotators. The authors propose two criteria for selecting effective golden questions—high certainty and format similarity to regular items—and validate them via real data collection and reward-model–driven experiments, showing real golden questions outperform instruction-based checks in distinguishing high-quality annotators. The results provide a principled basis for designing incentive schemes and monitoring data quality in large-scale annotation pipelines, with implications for reward modeling and post-training alignment workflows.

Abstract

Human-annotated data plays a vital role in training large language models (LLMs), such as supervised fine-tuning and human preference alignment. However, it is not guaranteed that paid human annotators produce high-quality data. In this paper, we study how to incentivize human annotators to do so. We start from a principal-agent model to model the dynamics between the company (the principal) and the annotator (the agent), where the principal can only monitor the annotation quality by examining $n$ samples. We investigate the maximum likelihood estimators (MLE) and the corresponding hypothesis testing to incentivize annotators: the agent is given a bonus if the MLE passes the test. By analyzing the variance of the outcome, we show that the strategic behavior of the agent makes the hypothesis testing very different from traditional ones: Unlike the exponential rate proved by the large deviation theory, the principal-agent model's hypothesis testing rate is of $Θ(1/\sqrt{n \log n})$. Our theory implies two criteria for the \emph{golden questions} to monitor the performance of the annotators: they should be of (1) high certainty and (2) similar format to normal ones. In that light, we select a set of golden questions in human preference data. By doing incentive-compatible experiments, we find out that the annotators' behavior is better revealed by those golden questions, compared to traditional survey techniques such as instructed manipulation checks.

Incentivizing High-Quality Human Annotations with Golden Questions

TL;DR

This work addresses incentivizing high-quality human annotations for LLM alignment by embedding annotator behavior into a principal–agent framework. It introduces golden questions and an MLE-based monitoring test to infer an unobserved commitment parameter , deriving a mini-max rate that reflects the strategic nature of annotators. The authors propose two criteria for selecting effective golden questions—high certainty and format similarity to regular items—and validate them via real data collection and reward-model–driven experiments, showing real golden questions outperform instruction-based checks in distinguishing high-quality annotators. The results provide a principled basis for designing incentive schemes and monitoring data quality in large-scale annotation pipelines, with implications for reward modeling and post-training alignment workflows.

Abstract

Human-annotated data plays a vital role in training large language models (LLMs), such as supervised fine-tuning and human preference alignment. However, it is not guaranteed that paid human annotators produce high-quality data. In this paper, we study how to incentivize human annotators to do so. We start from a principal-agent model to model the dynamics between the company (the principal) and the annotator (the agent), where the principal can only monitor the annotation quality by examining samples. We investigate the maximum likelihood estimators (MLE) and the corresponding hypothesis testing to incentivize annotators: the agent is given a bonus if the MLE passes the test. By analyzing the variance of the outcome, we show that the strategic behavior of the agent makes the hypothesis testing very different from traditional ones: Unlike the exponential rate proved by the large deviation theory, the principal-agent model's hypothesis testing rate is of . Our theory implies two criteria for the \emph{golden questions} to monitor the performance of the annotators: they should be of (1) high certainty and (2) similar format to normal ones. In that light, we select a set of golden questions in human preference data. By doing incentive-compatible experiments, we find out that the annotators' behavior is better revealed by those golden questions, compared to traditional survey techniques such as instructed manipulation checks.

Paper Structure

This paper contains 25 sections, 3 theorems, 58 equations, 5 figures, 3 algorithms.

Key Result

Theorem 3.4

Under Assumptions assum:regular_PA_model and assum:regulating_likelihood, there exists some $F_n$ such that $\theta_a(F_n) = \theta^\ast$ with Furthermore, such a rate is mini-max optimal: we can construct some example satisfying Assumptions assum:regular_PA_model and assum:regulating_likelihood such that for any $F_n$ with $\theta_a(F_n) = \theta^\ast$,

Figures (5)

  • Figure 1: Accuracy of Skywork-Reward-Gemma-2-27B-v0.2 on six human preference datasets in predicting the human preference, evaluated on the top 10% (most confident), top 50% (moderately confident), and all examples. Higher-certainty subsets of samples yield substantially higher accuracy.
  • Figure 2: Annotator behavior across different types of golden questions: instructed vs. real golden (Algorithm \ref{['alg:high_certainty_selection']}). Both types have certain answers, but the real golden questions are harder to identify. (a) Mean annotation accuracy across annotators with correct and incorrect responses to golden questions. (b) Difference in annotation accuracy between correct and incorrect response groups for each type. The results are based on 90 human annotators. More details can be found in Appendix \ref{['subapx:field_exp']}.
  • Figure 3: Accuracy of URM-LLaMa-3-8B and GRM-Llama3.2-3B on six human preference datasets.
  • Figure 4: Annotation accuracy distribution across different types of golden questions and annotator groups. Histograms show the accuracy on non-golden questions for annotators grouped by whether they correctly answered all instructed golden questions (top row) or all real golden questions (bottom row). (a) and (c) represent annotators who passed the golden questions ("Correct"), while (b) and (d) represent those who failed at least one ("Incorrect"). The red dashed line indicates the mean accuracy within each group. Group sizes and mean accuracies are annotated in each figure.
  • Figure :

Theorems & Definitions (7)

  • Remark 3.2
  • Theorem 3.4
  • Proposition B.1
  • proof : Proof of Proposition \ref{['prop:second_converge_to_first']}
  • proof : Proof of the upper bound part
  • proof : Proof of the lower bound part
  • Proposition B.2