Table of Contents
Fetching ...

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary

TL;DR

This work tackles the challenge of using probabilistic LLMs for formal reasoning while preserving deterministic guarantees by introducing a probabilistic context-free grammar (PCFG) framework to model distributions over SMT-LIB outputs. It formalizes the probability space $igmu_{T, heta,SMT}$ and shows how to approximate it with a PCFG $G_{SMT}$, estimating rule probabilities via MLE with smoothing and deriving a suite of uncertainty metrics including NSUI, entropy-based measures, and self-consistency signals. Through extensive evaluation on four reasoning benchmarks with five frontier LLMs, the study reveals task-dependent performance of SMT autoformalization (e.g., +34.8% on ProofWriter, -44.5% on FOLIO) and demonstrates that PCFG-derived metrics, especially when fused, enable selective verification that significantly reduces error rates (14–100%) with minimal abstention. The results highlight the potential of geometry-aware, structure-sensitive UQ to bridge neural generation and formal verification, offering practical pathways to more reliable LLM-driven formalization and guiding future work on grounding and alignment between formal and natural-language reasoning pathways.

Abstract

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

TL;DR

This work tackles the challenge of using probabilistic LLMs for formal reasoning while preserving deterministic guarantees by introducing a probabilistic context-free grammar (PCFG) framework to model distributions over SMT-LIB outputs. It formalizes the probability space and shows how to approximate it with a PCFG , estimating rule probabilities via MLE with smoothing and deriving a suite of uncertainty metrics including NSUI, entropy-based measures, and self-consistency signals. Through extensive evaluation on four reasoning benchmarks with five frontier LLMs, the study reveals task-dependent performance of SMT autoformalization (e.g., +34.8% on ProofWriter, -44.5% on FOLIO) and demonstrates that PCFG-derived metrics, especially when fused, enable selective verification that significantly reduces error rates (14–100%) with minimal abstention. The results highlight the potential of geometry-aware, structure-sensitive UQ to bridge neural generation and formal verification, offering practical pathways to more reliable LLM-driven formalization and guiding future work on grounding and alignment between formal and natural-language reasoning pathways.

Abstract

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

Paper Structure

This paper contains 23 sections, 1 theorem, 9 equations, 15 figures, 10 tables.

Key Result

Theorem 1

Let $\mu$ be a distribution on a discrete sample space $\Sigma^*$, with Shannon entropy $H(\mu) \;=\; -\sum_{x\in\Sigma^*} \mu(x)\,\log_2 \mu(x).$ Suppose we draw $N$ i.i.d. samples from $\mu$. Then, for any measurable subset $A\subseteq \Sigma^*$ with $\mu(A)=\epsilon$, the probability that none of

Figures (15)

  • Figure 1: Analysis of empirically observed LLM-generated SMT-LIB formalizations for the statement 'Everyone who studies math or physics and works hard will succeed.' The figure presents actual outputs and measurements: LLM-generated logical variations (left), measured PCFG rule frequencies (center), and computed probability distributions (right). The calculated uncertainty metrics are derived from the observed PCFG patterns and successfully predict formalization uncertainty. No synthetic or simulated data is used.
  • Figure 2: Spectral Radius VS Temperature
  • Figure 3: Grammar Entropy VS Temperature
  • Figure 4: KLD vs Temp
  • Figure 5: Entropy Ratio vs Temp
  • ...and 10 more figures

Theorems & Definitions (4)

  • Theorem 1: Coverage Guarantee
  • proof
  • Definition 1: Gaussian Temperature Schedule
  • Definition 2: Exponential Temperature Schedule