Table of Contents
Fetching ...

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Shang Liu, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li

TL;DR

This work addresses how to assess and incentivize human annotators for language preference data used in LLM alignment, focusing on the challenges of annotator heterogeneity and the unclear link between annotation quality and downstream performance. It develops a probabilistic annotator model and a principal-agent framework with continuous action spaces, proposing two assessment methods—self-consistency monitoring and expert-based monitoring—and two contract forms—binary and linear. Theoretical results establish convergence gaps: $Θ(1/\sqrt{n\log n})$ for binary contracts and $Θ(1/n)$ for linear contracts, showing that self-consistency monitoring outperforms expert-based monitoring under broad conditions. Empirical analysis on real preference datasets supports the theoretical claims and demonstrates practical advantages for self-consistency monitoring and linear incentive schemes in improving data quality for RLHF/DPO-style alignment tasks.

Abstract

Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we investigate the questions of assessing the performance of human annotators and incentivizing them to provide high-quality annotations. The quality assessment of language/text annotation faces two challenges: (i) the intrinsic heterogeneity among annotators, which prevents the classic methods that assume the underlying existence of a true label; and (ii) the unclear relationship between the annotation quality and the performance of downstream tasks, which excludes the possibility of inferring the annotators' behavior based on the model performance trained from the annotation data. Then we formulate a principal-agent model to characterize the behaviors of and the interactions between the company and the human annotators. The model rationalizes a practical mechanism of a bonus scheme to incentivize annotators which benefits both parties and it underscores the importance of the joint presence of an assessment system and a proper contract scheme. From a technical perspective, our analysis extends the existing literature on the principal-agent model by considering a continuous action space for the agent. We show the gap between the first-best and the second-best solutions (under the continuous action space) is of $Θ(1/\sqrt{n \log n})$ for the binary contracts and $Θ(1/n)$ for the linear contracts, where $n$ is the number of samples used for performance assessment; this contrasts with the known result of $\exp(-Θ(n))$ for the binary contracts when the action space is discrete. Throughout the paper, we use real preference annotation data to accompany our discussions.

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

TL;DR

This work addresses how to assess and incentivize human annotators for language preference data used in LLM alignment, focusing on the challenges of annotator heterogeneity and the unclear link between annotation quality and downstream performance. It develops a probabilistic annotator model and a principal-agent framework with continuous action spaces, proposing two assessment methods—self-consistency monitoring and expert-based monitoring—and two contract forms—binary and linear. Theoretical results establish convergence gaps: for binary contracts and for linear contracts, showing that self-consistency monitoring outperforms expert-based monitoring under broad conditions. Empirical analysis on real preference datasets supports the theoretical claims and demonstrates practical advantages for self-consistency monitoring and linear incentive schemes in improving data quality for RLHF/DPO-style alignment tasks.

Abstract

Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we investigate the questions of assessing the performance of human annotators and incentivizing them to provide high-quality annotations. The quality assessment of language/text annotation faces two challenges: (i) the intrinsic heterogeneity among annotators, which prevents the classic methods that assume the underlying existence of a true label; and (ii) the unclear relationship between the annotation quality and the performance of downstream tasks, which excludes the possibility of inferring the annotators' behavior based on the model performance trained from the annotation data. Then we formulate a principal-agent model to characterize the behaviors of and the interactions between the company and the human annotators. The model rationalizes a practical mechanism of a bonus scheme to incentivize annotators which benefits both parties and it underscores the importance of the joint presence of an assessment system and a proper contract scheme. From a technical perspective, our analysis extends the existing literature on the principal-agent model by considering a continuous action space for the agent. We show the gap between the first-best and the second-best solutions (under the continuous action space) is of for the binary contracts and for the linear contracts, where is the number of samples used for performance assessment; this contrasts with the known result of for the binary contracts when the action space is discrete. Throughout the paper, we use real preference annotation data to accompany our discussions.

Paper Structure

This paper contains 35 sections, 14 theorems, 133 equations, 9 figures, 4 algorithms.

Key Result

Proposition 3.1

The following inequality holds for any $0 \leq \eta_0 < \eta_1 \leq 1,$ where the infimum over $\Psi$ is taken with respect to any measurable function and the probability $\mathbb{P}(\cdot)$ on the left hand side is with respect to the law of eqn:Z_annotation_quality and eqn:annotation_eta. Here $\mathcal{P}_{\eta_0}$ and $\mathcal{P}_{\eta_1}$ on the right hand side re

Figures (9)

  • Figure 1: How expert-based monitoring fails on real preference data. Upper four plots: histograms of $\mathbb{P}(y_{\text{chosen}} \succ y_{\text{rejected}} \mid x)$ ($y_{\text{chosen}}$ and $y_{\text{rejected}}$ represent the chosen/preferred and rejected responses, respectively). Lower four plots: the lower bound of the sum of two types of errors against the number of tested annotations $n$ at different $\eta_0$ with $\eta_1=1$ (see Proposition \ref{['prop:info_lower_bound']}). The observations align with Proposition \ref{['prop:info_lower_bound']}: the lower bound (i) decreases monotonically with $n$ and increases with $\eta_0$, and (ii) depends on the underlying distribution of preference probabilities. Note that the PKU dataset, where preference probabilities are mostly around 1/2, faces higher errors in assessing annotation quality than datasets (e.g., Skywork) where preference probabilities deviate further from 1/2. See Appendix \ref{['appx:fig_hist_LB']} for the setup and additional results with $\eta_1<1$.
  • Figure 2: Comparison between self-consistency monitoring (upper bound) and expert-based monitoring (lower bound). For the sum of two types of errors, we plot the upper bound for self-consistency monitoring with various values of $\delta$ (blue, thick line) and the lower bound for expert-based monitoring (red, dashed line), evaluated at $\eta_0 \in \{0.8, 0.9\}$ and $\eta_1 = 1$ for two datasets. Even with a nontrivial disagreement probability $\delta$, self-consistency monitoring outperforms expert-based monitoring over a wide range of $n$, especially when the average preference probability is near $1/2$ (e.g., PKU). See Appendix \ref{['appx:fig_self_UB']} for details on the experimental setup and additional results with $\eta_1 < 1$.
  • Figure 3: Normalized principal utility gap ($\mathcal{C}-\mathcal{C}_n$ and $\mathcal{C}-\tilde{\mathcal{C}}_n$) under different monitoring and contract settings. In these experiments, we set $U_0=0$, $\delta=0.02$, $\mu(\eta)=1/2\eta^{4/5}$, $G_a(w_a)=1-\exp(-w_a)$, and $E(\eta)=0.18\eta^2$ (see Appendix \ref{['appx:fig_contract_rank']} for further details and additional configurations). (i) The self-consistency monitoring consistently outperforms the expert-based monitoring given the same second-best formulation and contract type. (ii) The performance of the expert-based monitoring depends on the underlying distribution of preference probabilities and may perform poorly in some cases (e.g., PKU). (iii) The numerical results validate Theorems \ref{['thm:binary']} and \ref{['thm:linear_contract']}: the linear contract closes the gap at a faster rate than the binary contract in $n$. For instance, in PKU under $\tilde{\mathcal{C}}_n$ with expert-based monitoring (red square line), the binary contract initially exhibits a lower utility gap than the linear contract, but when $n\geq 100$, the linear contract achieves a lower utility gap.
  • Figure 4: Illustration for Lemma \ref{['lemma:binomial_properties']}.
  • Figure 5: Calibration for two datasets. (Top row) Empirical preference probability $p(x,y_1,y_2)$ vs. the predicted probability before and after calibration. The dashed line ($x=y$) represents perfect alignment between predictions and empirical observations. (Bottom row) Histogram of the (predicted) preference probability $p(x,y_1,y_2)$ before and after calibration. We can see the calibration procedure improves alignment between the predicted probabilities and the empirical observations for both datasets.
  • ...and 4 more figures

Theorems & Definitions (25)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Proposition 4.4
  • Theorem 4.6
  • Theorem 4.7
  • Definition A.1
  • Lemma A.2: Le Cam's Lemma le2012asymptotic
  • Lemma A.3: Bretagnolle-Huber's Inequality bretagnolle1978estimation
  • ...and 15 more