Table of Contents
Fetching ...

BEAVER: An Efficient Deterministic LLM Verifier

Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh

TL;DR

BEAVER tackles the lack of deterministic guarantees for LLM constraint satisfaction by introducing a frontier-driven, prefix-closed verification framework. It leverages a token trie and a frontier to incrementally compute sound lower and upper bounds on constraint-satisfaction probability, with a branch-and-bound style algorithm that tightens bounds anytime during execution. The approach demonstrates substantial tightening over rejection sampling across GSM-Symbolic, Enron leakage, and CyberSecEval tasks, and provides actionable risk assessment by exposing high-risk instances with tight bounds. The work also analyzes runtime and decoding-parameter effects, discusses limitations, and releases open-source tooling for practical deployment in safety-critical settings.

Abstract

As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.

BEAVER: An Efficient Deterministic LLM Verifier

TL;DR

BEAVER tackles the lack of deterministic guarantees for LLM constraint satisfaction by introducing a frontier-driven, prefix-closed verification framework. It leverages a token trie and a frontier to incrementally compute sound lower and upper bounds on constraint-satisfaction probability, with a branch-and-bound style algorithm that tightens bounds anytime during execution. The approach demonstrates substantial tightening over rejection sampling across GSM-Symbolic, Enron leakage, and CyberSecEval tasks, and provides actionable risk assessment by exposing high-risk instances with tight bounds. The work also analyzes runtime and decoding-parameter effects, discusses limitations, and releases open-source tooling for practical deployment in safety-critical settings.

Abstract

As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.

Paper Structure

This paper contains 55 sections, 4 theorems, 17 equations, 4 figures, 6 tables, 3 algorithms.

Key Result

lemma 1

If $P = \sum_{\pmb{s}_i \in \mathcal{C} }\mu(\pmb{s}_i) * \mathbbm{1}[\pmb{s}_i \models \Phi]$ then $0 \leq P \leq 1$.

Figures (4)

  • Figure 1: BEAVER workflow for computing sound probability bounds. Given a prompt, language model, and a prefix-closed semantic constraint, BEAVER iteratively: (1) selects an incomplete leaf from the frontier, (2) expands it by querying the model and adding valid continuations to the token trie, and (3) updates the sound probability bounds $[P_{LB}, P_{UB}]$ based on the new frontier state.
  • Figure 2: Evolution of the token trie $\mathcal{T}$ through three iterations of BEAVER on the bash command safety constraint. Starting from the empty trie, BEAVER expands nodes $n_0, n_1,$ and $n_4$ in sequence. Green nodes indicate incomplete sequences eligible for expansion, turquoise nodes indicate complete sequences (ending in $\langle \texttt{eos} \rangle$). Probability bounds tighten from $[0.01, 0.9]$ after iteration 1, to $[0.213, 0.815]$ after iteration 3, to $[0.7, 0.8]$ after iteration 10. Low probability sequence nodes omitted for brevity.
  • Figure 3: Comparison of Avg probability bounds by BEAVER and Rejection Sampling over Forward Passes and Time for Qwen2.5-14B Instruct on GSM-Symbolic Dataset
  • Figure : SearchSequence -- Max-$\mu$ strategy

Theorems & Definitions (10)

  • definition 1: Semantic Constraint
  • definition 2: Prefix-closed semantic constraints
  • definition 3: Complete Token Sequences
  • definition 4: verification problem
  • definition 5: Token Trie
  • definition 6: Frontier
  • lemma 1
  • lemma 2
  • theorem 1: Soundness of the bounds
  • theorem 2: Worst-Case Complexity of Algorithm \ref{['alg:1']}