Table of Contents
Fetching ...

Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Abstract

We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), applying anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift -- the log-likelihood ratio between full models and deliberately weakened "skeleton" baselines -- using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation by 22-28% vs. sequential baselines while maintaining delta-level control with 12% computational overhead. We introduce automated skeletons (distilled submodels, randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries + verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness -- 10.9% of stopped sequences remain incorrect even with the gate (13.2-22.7% without it). EDFL serves as a first-stage filter reducing verification burden by 83%, not as a standalone solution for safety-critical domains.

Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift

Abstract

We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), applying anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift -- the log-likelihood ratio between full models and deliberately weakened "skeleton" baselines -- using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation by 22-28% vs. sequential baselines while maintaining delta-level control with 12% computational overhead. We introduce automated skeletons (distilled submodels, randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries + verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness -- 10.9% of stopped sequences remain incorrect even with the gate (13.2-22.7% without it). EDFL serves as a first-stage filter reducing verification burden by 83%, not as a standalone solution for safety-critical domains.

Paper Structure

This paper contains 31 sections, 3 theorems, 7 equations, 4 figures, 20 tables.

Key Result

Theorem 3.1

Let $X_1, X_2, \ldots$ be information lift observations with $X_t \in [0, c]$. Using mixture e-process $M_t = \sum_{k=1}^K w_k M_t(\lambda_k)$ with threshold $u = 1/\delta$ and adaptive resets with budgets $\delta_j = 6\delta/(\pi^2 j^2)$, the stopping rule $\tau = \inf\{t: M_t \geq u\}$ satisfies: for any $\epsilon > 0$, where $\mu_s = \mathbb{E}[X_s | \mathcal{F}_{s-1}]$. This guarantee holds r

Figures (4)

  • Figure 1: Time-uniform empirical risk on GSM8K. Sequential-EDFL tracks target $\delta=0.1$ most closely.
  • Figure 2: Information lift: Full model $P$ (peaked) vs. skeleton $S$ (flat). Large lift indicates evidence accumulation.
  • Figure 3: Sequential-EDFL algorithm overview. Per-token: compute lift, update e-process, stop at boundary $u_J$, reset on drift.
  • Figure 4: Information sufficiency vs. factual correctness. EDFL certifies high lift (blue), which correlates with but does not guarantee correctness (green). The 16% gap (average across datasets) represents confident incorrect answers—our method's fundamental limitation for safety-critical deployment.

Theorems & Definitions (6)

  • Definition 2.1: Information Lift
  • Theorem 3.1: Anytime-Valid Information Sufficiency Certification
  • Lemma 3.2: Monotone delay preserves validity
  • Theorem 3.3: Validity with adaptive resets
  • Definition A.1: Clipped lift
  • proof