$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

Rushang Karia; Daniel Bramblett; Daksh Dobhal; Pulkit Verma; Siddharth Srivastava

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

Rushang Karia, Daniel Bramblett, Daksh Dobhal, Pulkit Verma, Siddharth Srivastava

TL;DR

This work tackles scalable, ground-truth-free evaluation of LLMs on translating between formal syntax ($FS$) and natural language ($NL$) in domains including propositional logic, first-order logic, and regular expressions. It introduces $\forall$uto$\exists$val, a CFG-based automatic dataset generator paired with formal verifiers to guarantee semantic equivalence between $\varphi$ and $\varphi'$ produced by a single LLM performing both directions of translation. Extensive experiments across multiple SOTA models reveal that current LLMs struggle as descriptional complexity grows and cannot reliably serve as verifiers, underscoring the need for robust, automatic benchmarking in formal translation tasks. The framework is open-source and scalable, enabling researchers to evaluate and advance LLM capabilities in formal reasoning while mitigating reliance on hand-crafted datasets.

Abstract

This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

TL;DR

This work tackles scalable, ground-truth-free evaluation of LLMs on translating between formal syntax (

) and natural language (

) in domains including propositional logic, first-order logic, and regular expressions. It introduces

uto

val, a CFG-based automatic dataset generator paired with formal verifiers to guarantee semantic equivalence between

and

produced by a single LLM performing both directions of translation. Extensive experiments across multiple SOTA models reveal that current LLMs struggle as descriptional complexity grows and cannot reliably serve as verifiers, underscoring the need for robust, automatic benchmarking in formal translation tasks. The framework is open-source and scalable, enabling researchers to evaluate and advance LLM capabilities in formal reasoning while mitigating reliance on hand-crafted datasets.

Abstract

This paper presents

uto

val, a new approach for scaling LLM assessment in translating formal syntax -- such as first-order logic, regular expressions, etc -- to natural language (interpretation) or vice versa (compilation), thereby facilitating their use in applications such as generating/explaining logic and control flow for programs etc. Existing approaches for LLM assessment in these areas require labor-intensive ground-truth creation, the availability of which undermines the separation of training and test sets. Furthermore, such datasets typically include relatively few hand-coded test cases over which LLM accuracy is determined, thus making them inadequate for determining the safety or correctness of their generated outputs. We introduce a new approach that utilizes context-free grammars (CFGs) to generate out-of-distribution datasets on the fly and perform closed-loop testing of LLM capabilities using formal verifiers to guarantee the correctness of LLM outputs without any human intervention. We release our dataset and benchmark as open-source code at \url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm. Our experiments reveal that SOTA LLMs are unable to solve the formal translation task adequately.

Paper Structure (26 sections, 5 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Formal Framework
Our Approach: $\forall$uto$\exists$val for Autonomous Assessment of ↔FS$\textit{NL}\leftrightarrow\textit{FS}$
Datasets and Assessment Framework
Dynamic Dataset Generator and Pre-generated Datasets
Assessment Metrics
Prompts
Assessment of SOTA LLMs
Related Work
Closing Remarks
Appendix Organization
Dataset Information
Dataset Storage and Maintenance Plans
Dataset Usage Guidelines
Dataset Generation
...and 11 more sections

Figures (13)

Figure 1: Our overall process for autonomous assessment of ↔FS$\textit{NL}\leftrightarrow\textit{FS}$.
Figure 2: Grammars (described in Sec. \ref{['subsec:dataset_desc']}) used for synthesizing the datasets in $\forall$uto$\exists$val.
Figure 3: Zero-shot Pass@1 results (avg. over 10 batches, higher values better) from using $\forall$uto$\exists$val to assess LLMs w.r.t. § A1, § A2, § A3 (Sec. \ref{['subsec:metrics']}) on the packaged datasets. The x-axis represents an increasing order of descriptional complexity. Our prompts, few-shot results, other statistical measures, LLM hyperparameters, compute resources, etc. are included in the supplementary material.
Figure 4: Zero-shot Pass@1 results (avg. over 10 batches, higher values better) for 3-SAT from using $\forall$uto$\exists$val to assess LLMs w.r.t. § A1, § A2, § A3 (Sec. \ref{['subsec:metrics']}) on the packaged datasets. The x-axis is the # of operators.
Figure 5: Average and standard deviation error of Zero-shot Pass@1 results from using $\forall$uto$\exists$val to assess LLMs w.r.t. § A1, § A2 (Sec. \ref{['subsec:metrics']}) on the first batch of the packaged datasets. The x-axis represents an increasing order of descriptional complexity.
...and 8 more figures

Theorems & Definitions (2)

Definition 2.1: Interpretation: →NL$\textit{FS}\rightarrow\textit{NL}$
Definition 2.2: Compilation: →FS$\textit{NL}\rightarrow\textit{FS}$

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

TL;DR

Abstract

$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (2)