Table of Contents
Fetching ...

Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

Rushang Karia, Daniel Bramblett, Daksh Dobhal, Siddharth Srivastava

TL;DR

AutoEval introduces an autonomous benchmark for truth maintenance in formal language tasks by auto-generating ground-truth FL via context-free grammars, translating between NL and FL with non-deterministic informalisations and autoformalisations, and verifying semantic equivalence with formal provers. It defines the central metric, the forall-ut oexists-? L score, including parameterized and calibrated variants, and uses a formal verifier to guard against brittle syntactic checks. Empirical results show SOTA LLMs struggle to maintain truth as task complexity grows, while calibrated AutoEval scores strongly predict performance on other FL benchmarks, establishing AutoEval as a scalable surrogate for evaluating truth-maintenance capabilities. The framework is extensible to additional formal syntaxes and can support autonomous evaluation pipelines for future LLMs, albeit with limitations such as FO logic undecidability and timeout concerns. Overall, AutoEval provides a principled, automated, and scalable approach to evaluating LLM truth maintenance with practical implications for reliable NL→FL reasoning and formal translation tasks.

Abstract

This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.

Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks

TL;DR

AutoEval introduces an autonomous benchmark for truth maintenance in formal language tasks by auto-generating ground-truth FL via context-free grammars, translating between NL and FL with non-deterministic informalisations and autoformalisations, and verifying semantic equivalence with formal provers. It defines the central metric, the forall-ut oexists-? L score, including parameterized and calibrated variants, and uses a formal verifier to guard against brittle syntactic checks. Empirical results show SOTA LLMs struggle to maintain truth as task complexity grows, while calibrated AutoEval scores strongly predict performance on other FL benchmarks, establishing AutoEval as a scalable surrogate for evaluating truth-maintenance capabilities. The framework is extensible to additional formal syntaxes and can support autonomous evaluation pipelines for future LLMs, albeit with limitations such as FO logic undecidability and timeout concerns. Overall, AutoEval provides a principled, automated, and scalable approach to evaluating LLM truth maintenance with practical implications for reliable NL→FL reasoning and formal translation tasks.

Abstract

This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness, such as truth maintenance in translation and logical reasoning. AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling: (a) ability to evaluate LLMs of increasing sophistication by auto-generating tasks at different levels of difficulty; (b) auto-generation of ground truth that eliminates dependence on expensive and time-consuming human annotation; (c) the use of automatically generated, randomized datasets that mitigate the ability of successive LLMs to overfit to static datasets used in many contemporary benchmarks. Empirical analysis shows that an LLM's performance on AutoEval is highly indicative of its performance on a diverse array of other benchmarks focusing on translation and reasoning tasks, making it a valuable autonomous evaluation paradigm in settings where hand-curated datasets can be hard to obtain and/or update.

Paper Structure

This paper contains 36 sections, 5 equations, 23 figures, 20 tables, 1 algorithm.

Figures (23)

  • Figure 1: The $\forall$uto$\exists$$\lor\!\land$L process for autonomous evaluation of LLM truth maintenance w.r.t. A∘I)^n(φ_0)$(\mathcal{A}\circ\mathcal{I})^{n}(\varphi_0)$.
  • Figure 2: CFGs (described in Sec. \ref{['sec:datasets']}) used for synthesizing the datasets in $\forall$uto$\exists$$\lor\!\land$L.
  • Figure 3: Zero-shot Pass@1 results from using $\forall$uto$\exists$$\lor\!\land$L to assess LLMs w.r.t. § A1, § A2, § A3 on the packaged datasets (Sec. \ref{['sec:datasets']}). The x-axis represents an increasing descriptional complexity. The y-axis is each evaluated LLM's syntactic compliance rate (1st row), parameterized $\forall$uto$\exists$$\lor\!\land$L score (2nd row), and $F_1$ score as a verifier (3rd row). Additional results (prompt calibration, few-shot, etc.) are included in the Appendix.
  • Figure 4: Correlation between scores on $\forall$uto$\exists$$\lor\!\land$L and static benchmarks from the literature. The Pearson correlation coefficient ($\rho$) and the $p$-value (values $\leq 0.05$ are statistically significant) are annotated in the top left. The calibrated $\forall$uto$\exists$$\lor\!\land$L score $S_{cal}(D,d)$ use all strings in dataset $D$ with descriptional complexity $d$ bounded above, as shown in the plots (App. \ref{['Appendix:corelation_splits']}). Grey hexagons () represent data from 10 other models.
  • Figure 5: Predictive power of $\forall$uto$\exists$$\lor\!\land$L w.r.t other benchmarks. Benchmark metrics appear after the colon.
  • ...and 18 more figures

Theorems & Definitions (4)

  • Definition 2.1: Autoformalization: $\mathcal{A}$
  • Definition 2.2: Informalization: $\mathcal{I}$
  • Definition 2.3: LLM Truth Maintenance
  • Definition 3.1: Predictive Power