Oracle-Checker Scheme for Evaluating a Generative Large Language Model

Yueling Jenny Zeng; Li-C. Wang; Thomas Ibbetson

Oracle-Checker Scheme for Evaluating a Generative Large Language Model

Yueling Jenny Zeng, Li-C. Wang, Thomas Ibbetson

TL;DR

This work introduces an oracle-checker scheme that treats a generative LLM as an oracle whose outputs are validated by domain-specific checkers. It combines three strategies—property-based validation, proof-based validation, and trust-oriented validation—across two tasks: entity extraction and paraphrase decision. The entity-extraction checker relies on a linearity test that treats extraction as a homomorphism between structured entity groups, while the paraphrase checker uses alignment-based proofs to verify semantic equivalence via rho- and phi-alignments. Experiments on DOCRED, RISC-V, and MSRP with GPT-3.5 as the oracle demonstrate that the approach can selectively accept trustworthy yes-answers and reject dubious no-answers, revealing insights into the trust and definitional challenges inherent in LLM outputs. This framework offers a principled, if not yet optimal, path to articulating subjective task definitions and assessing LLM trustworthiness in real-world, domain-specific settings.

Abstract

This work presents a novel approach called oracle-checker scheme for evaluating the answer given by a generative large language model (LLM). Two types of checkers are presented. The first type of checker follows the idea of property testing. The second type of checker follows the idea of program checking. Their applications are demonstrated in two separate contexts, entity extraction and paraphrase decision, respectively.

Oracle-Checker Scheme for Evaluating a Generative Large Language Model

TL;DR

Abstract

Paper Structure (33 sections, 2 theorems, 10 figures, 15 tables, 8 algorithms)

This paper contains 33 sections, 2 theorems, 10 figures, 15 tables, 8 algorithms.

Introduction
The oracle-checker scheme
Theoretical background
A linearity test for entity extraction
$G_1,G_2$ in entity extraction
Key points in the linearity test
Program checking for paraphrase decision
From graph isomorphism to semantic equivalence
Key points in our alignment-based approximation
Finding $\rho$-alignments and $\phi$-alignments
Experiments
Results on entity extraction
Results on paraphrase decision
Sanity checks
Results from the proof perspective
...and 18 more sections

Key Result

Theorem 3.1

Each linearity test provides its own assurance: If $f$ is a homomorphism from $G_1$ to $G_2$, the test passes with probability 1. If $f$ is $\epsilon$-far from ${\cal F}$ which is the set of all homomorphism from $G_1$ to $G_2$, then the test fails with a probability at least $3\epsilon - 6 \epsilon

Figures (10)

Figure 1: For a large LLM $L$, a checker $C_L$ can follow one of the three strategies. For entity extraction, $C_L$ implements the property strategy. For paraphrase decision, $C_L$ implements the proof strategy if the answer is yes, and the trust strategy if the answer is no.
Figure 2: For linearity test in entity extraction, the domain group $G_1$ is defined based on a set of binary vectors and the XOR ($\oplus$) operator. The range group $G_2$ is defined based on a set of synonymous entities (generated by the synonym generator $syn$) and the XNOR operator $\ominus$ on its subsets, also represented as vectors.
Figure 3: Mapping from graph isomorphism to semantic equivalence by replace permutation $\pi$ with an alignment $\rho$ as the proof
Figure 4: The scheme to approximate the non-isomorphism test
Figure 5: Given two syntactic trees $U,V$ as source and target, respectively, a $\rho$-alignment is found between the two (in the example each phrase mapping, between a pair of nodes and its corresponding pair of phrases, is colored the same). Searching for $\rho$-alignment is based on matching a node $\mu$ in $U$ with its most similar node $\nu$ in $V$. The similarity from a node in $U$ to a node in $V$ is determined by the BERT model devlin-etal-2019-bert. Each phrase mapping is a sub-tree containing a matching path of depth at least two from the sub-tree's root.
...and 5 more figures

Theorems & Definitions (2)

Theorem 3.1
Theorem 3.2

Oracle-Checker Scheme for Evaluating a Generative Large Language Model

TL;DR

Abstract

Oracle-Checker Scheme for Evaluating a Generative Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)