Algorithmically Establishing Trust in Evaluators

Adrian de Wynter

Algorithmically Establishing Trust in Evaluators

Adrian de Wynter

TL;DR

This work introduces the No-Data Algorithm (NDA) to certify an evaluator's trustworthiness without any labeled data, by coupling an Evaluator-Verifier (EV) protocol inspired by zero-knowledge proofs with a label-flipping mechanism controlled by a tunable parameter φ. The authors prove that after r EV rounds, a knowledgeable evaluator is accepted with probability at least $1 - (1/4)^r$ and deceptive evaluators are detected with high probability, while keeping runtime linear in the dataset size. Empirically, NDA is validated on synthetic binary-data tasks with both traditional learners and LLM-based evaluators, and applied to a low-resource language (West Frisian) labelling task, demonstrating robustness to rubric ambiguity and model competency. The paper also discusses the necessity of well-designed rubrics, the extension to k-ary labels, and the practical limitations and ethical considerations of deploying such a trust mechanism in scarce-label domains. Overall, NDA offers a mathematically grounded approach to trustworthy evaluation in settings where traditional references are unavailable or unreliable.

Abstract

An evaluator, such as an LLM-as-a-judge, is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. Traditional approaches either rely on testing the evaluator against references or assume that it `knows' somehow the correct labelling. Both approaches fail when references are unavailable: the former requires data, and the latter is an assumption, not evidence. To address this, we introduce the `No-Data Algorithm', which provably establishes trust in an evaluator without requiring any labelled data. Our algorithm works by successively posing challenges to said evaluator. We prove that after $r$ challenge rounds, it accepts an evaluator which knows the correct labels with probability $ \geq 1 - (1/4)^r$, and reliably flags untrustworthy ones. We present formal proofs of correctness, empirical tests, and applications to assessing trust in LLMs-as-judges for low-resource language labelling. Our work enables scientifically-grounded evaluator trust in low-data domains, addressing a critical bottleneck for scalable, trustworthy LLM deployment.

Algorithmically Establishing Trust in Evaluators

TL;DR

and deceptive evaluators are detected with high probability, while keeping runtime linear in the dataset size. Empirically, NDA is validated on synthetic binary-data tasks with both traditional learners and LLM-based evaluators, and applied to a low-resource language (West Frisian) labelling task, demonstrating robustness to rubric ambiguity and model competency. The paper also discusses the necessity of well-designed rubrics, the extension to k-ary labels, and the practical limitations and ethical considerations of deploying such a trust mechanism in scarce-label domains. Overall, NDA offers a mathematically grounded approach to trustworthy evaluation in settings where traditional references are unavailable or unreliable.

Abstract

challenge rounds, it accepts an evaluator which knows the correct labels with probability

, and reliably flags untrustworthy ones. We present formal proofs of correctness, empirical tests, and applications to assessing trust in LLMs-as-judges for low-resource language labelling. Our work enables scientifically-grounded evaluator trust in low-data domains, addressing a critical bottleneck for scalable, trustworthy LLM deployment.

Algorithmically Establishing Trust in Evaluators

TL;DR

Abstract

Algorithmically Establishing Trust in Evaluators

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (12)