Table of Contents
Fetching ...

Algorithmically Establishing Trust in Evaluators

Adrian de Wynter

TL;DR

This work introduces the No-Data Algorithm (NDA) to certify an evaluator's trustworthiness without any labeled data, by coupling an Evaluator-Verifier (EV) protocol inspired by zero-knowledge proofs with a label-flipping mechanism controlled by a tunable parameter φ. The authors prove that after r EV rounds, a knowledgeable evaluator is accepted with probability at least $1 - (1/4)^r$ and deceptive evaluators are detected with high probability, while keeping runtime linear in the dataset size. Empirically, NDA is validated on synthetic binary-data tasks with both traditional learners and LLM-based evaluators, and applied to a low-resource language (West Frisian) labelling task, demonstrating robustness to rubric ambiguity and model competency. The paper also discusses the necessity of well-designed rubrics, the extension to k-ary labels, and the practical limitations and ethical considerations of deploying such a trust mechanism in scarce-label domains. Overall, NDA offers a mathematically grounded approach to trustworthy evaluation in settings where traditional references are unavailable or unreliable.

Abstract

An evaluator, such as an LLM-as-a-judge, is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. Traditional approaches either rely on testing the evaluator against references or assume that it `knows' somehow the correct labelling. Both approaches fail when references are unavailable: the former requires data, and the latter is an assumption, not evidence. To address this, we introduce the `No-Data Algorithm', which provably establishes trust in an evaluator without requiring any labelled data. Our algorithm works by successively posing challenges to said evaluator. We prove that after $r$ challenge rounds, it accepts an evaluator which knows the correct labels with probability $ \geq 1 - (1/4)^r$, and reliably flags untrustworthy ones. We present formal proofs of correctness, empirical tests, and applications to assessing trust in LLMs-as-judges for low-resource language labelling. Our work enables scientifically-grounded evaluator trust in low-data domains, addressing a critical bottleneck for scalable, trustworthy LLM deployment.

Algorithmically Establishing Trust in Evaluators

TL;DR

This work introduces the No-Data Algorithm (NDA) to certify an evaluator's trustworthiness without any labeled data, by coupling an Evaluator-Verifier (EV) protocol inspired by zero-knowledge proofs with a label-flipping mechanism controlled by a tunable parameter φ. The authors prove that after r EV rounds, a knowledgeable evaluator is accepted with probability at least and deceptive evaluators are detected with high probability, while keeping runtime linear in the dataset size. Empirically, NDA is validated on synthetic binary-data tasks with both traditional learners and LLM-based evaluators, and applied to a low-resource language (West Frisian) labelling task, demonstrating robustness to rubric ambiguity and model competency. The paper also discusses the necessity of well-designed rubrics, the extension to k-ary labels, and the practical limitations and ethical considerations of deploying such a trust mechanism in scarce-label domains. Overall, NDA offers a mathematically grounded approach to trustworthy evaluation in settings where traditional references are unavailable or unreliable.

Abstract

An evaluator, such as an LLM-as-a-judge, is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. Traditional approaches either rely on testing the evaluator against references or assume that it `knows' somehow the correct labelling. Both approaches fail when references are unavailable: the former requires data, and the latter is an assumption, not evidence. To address this, we introduce the `No-Data Algorithm', which provably establishes trust in an evaluator without requiring any labelled data. Our algorithm works by successively posing challenges to said evaluator. We prove that after challenge rounds, it accepts an evaluator which knows the correct labels with probability , and reliably flags untrustworthy ones. We present formal proofs of correctness, empirical tests, and applications to assessing trust in LLMs-as-judges for low-resource language labelling. Our work enables scientifically-grounded evaluator trust in low-data domains, addressing a critical bottleneck for scalable, trustworthy LLM deployment.

Paper Structure

This paper contains 46 sections, 5 theorems, 6 equations, 1 figure, 16 tables, 1 algorithm.

Key Result

Lemma 5.1

The probability that the verifier fails to detect a lie by the evaluator in the the EV protocol sub-game after $r$ rounds is $(1/4)^r$.

Figures (1)

  • Figure 1: EV protocol flow. At every round, the evaluator (blue) generates an $x'$ similar to $x$, and a partial label $\tilde{y}'$. It then answers one of two (chosen uniformly at random) challenges by the verifier (orange). If the evaluator does not pass the challenge, the protocol returns failure. Otherwise, the game is repeated. If the rounds are over, it returns a succeeded state. In either case, it also returns $\tilde{y}'$.

Theorems & Definitions (12)

  • Remark 4.1
  • Lemma 5.1: EV Protocol Correctness Bound
  • proof
  • Theorem 5.2: No-Data Algorithm Correctness Bound
  • proof
  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Lemma 2.1
  • ...and 2 more