Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks
Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava
TL;DR
AutoEval introduces an autonomous benchmark for truth maintenance in formal language tasks by auto-generating ground-truth FL via context-free grammars, translating between NL and FL with non-deterministic informalisations and autoformalisations, and verifying semantic equivalence with formal provers. It defines the central metric, the forall-ut oexists-? L score, including parameterized and calibrated variants, and uses a formal verifier to guard against brittle syntactic checks. Empirical results show SOTA LLMs struggle to maintain truth as task complexity grows, while calibrated AutoEval scores strongly predict performance on other FL benchmarks, establishing AutoEval as a scalable surrogate for evaluating truth-maintenance capabilities. The framework is extensible to additional formal syntaxes and can support autonomous evaluation pipelines for future LLMs, albeit with limitations such as FO logic undecidability and timeout concerns. Overall, AutoEval provides a principled, automated, and scalable approach to evaluating LLM truth maintenance with practical implications for reliable NL→FL reasoning and formal translation tasks.
Abstract
This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.
