Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks
Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary
TL;DR
This work tackles the challenge of using probabilistic LLMs for formal reasoning while preserving deterministic guarantees by introducing a probabilistic context-free grammar (PCFG) framework to model distributions over SMT-LIB outputs. It formalizes the probability space $igmu_{T, heta,SMT}$ and shows how to approximate it with a PCFG $G_{SMT}$, estimating rule probabilities via MLE with smoothing and deriving a suite of uncertainty metrics including NSUI, entropy-based measures, and self-consistency signals. Through extensive evaluation on four reasoning benchmarks with five frontier LLMs, the study reveals task-dependent performance of SMT autoformalization (e.g., +34.8% on ProofWriter, -44.5% on FOLIO) and demonstrates that PCFG-derived metrics, especially when fused, enable selective verification that significantly reduces error rates (14–100%) with minimal abstention. The results highlight the potential of geometry-aware, structure-sensitive UQ to bridge neural generation and formal verification, offering practical pathways to more reliable LLM-driven formalization and guiding future work on grounding and alignment between formal and natural-language reasoning pathways.
Abstract
Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization's domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.
