Oracle-Checker Scheme for Evaluating a Generative Large Language Model
Yueling Jenny Zeng, Li-C. Wang, Thomas Ibbetson
TL;DR
This work introduces an oracle-checker scheme that treats a generative LLM as an oracle whose outputs are validated by domain-specific checkers. It combines three strategies—property-based validation, proof-based validation, and trust-oriented validation—across two tasks: entity extraction and paraphrase decision. The entity-extraction checker relies on a linearity test that treats extraction as a homomorphism between structured entity groups, while the paraphrase checker uses alignment-based proofs to verify semantic equivalence via rho- and phi-alignments. Experiments on DOCRED, RISC-V, and MSRP with GPT-3.5 as the oracle demonstrate that the approach can selectively accept trustworthy yes-answers and reject dubious no-answers, revealing insights into the trust and definitional challenges inherent in LLM outputs. This framework offers a principled, if not yet optimal, path to articulating subjective task definitions and assessing LLM trustworthiness in real-world, domain-specific settings.
Abstract
This work presents a novel approach called oracle-checker scheme for evaluating the answer given by a generative large language model (LLM). Two types of checkers are presented. The first type of checker follows the idea of property testing. The second type of checker follows the idea of program checking. Their applications are demonstrated in two separate contexts, entity extraction and paraphrase decision, respectively.
