Table of Contents
Fetching ...

Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

Leon Chlon, Ahmed Karim, Maggie Chlon, MarcAntonio Awada

TL;DR

The paper tackles the problem that transformer-based evidence-grounded QA exhibits order sensitivity when presenting exchangeable evidence, leading to unreliable answers or hallucinations. It develops a theory linking permutation-induced dispersion to information budgets via the Quantified Martingale Violation (QMV) bound and the Expectation-level Decompression Law (EDFL), then derives deployment-facing constructs—Bits-to-Trust (B2T), Risk-of-Hallucination (RoH), and the Information Sufficiency Ratio (ISR)—that yield a fixed, ISR=1 gating rule under permutation mixtures. Empirically, it validates a logarithmic dispersion across context depth, gains from permutation mixtures, and a causal dose-response showing information reduces hallucinations; a held-out audit demonstrates strong boundary alignment and low hallucination when using ISR gating. The work offers a practical abstention-based interface for reliable, auditable evidence-grounded QA, with clear operational guidance for deploying predicate-verifiable systems and managing computational overhead through permutation-mixing strategies.

Abstract

Transformers used for evidence-grounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (``hallucinations'' under a Bernoulli predicate). We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings. This objective can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation--realization gap via a Quantified Martingale Violation (QMV) bound that predicts $\mathcal{O}(\log n)$ growth in permutation dispersion under harmonic positional sensitivity. We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define \emph{Bits-to-Trust} (B2T), \emph{Risk-of-Hallucination} (RoH), and the \emph{Information Sufficiency Ratio} (ISR), together with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures. On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR $= 1$ gate attains 0.0--0.7\% hallucination with 20.6--27.9\% abstention (95\% confidence intervals).

Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

TL;DR

The paper tackles the problem that transformer-based evidence-grounded QA exhibits order sensitivity when presenting exchangeable evidence, leading to unreliable answers or hallucinations. It develops a theory linking permutation-induced dispersion to information budgets via the Quantified Martingale Violation (QMV) bound and the Expectation-level Decompression Law (EDFL), then derives deployment-facing constructs—Bits-to-Trust (B2T), Risk-of-Hallucination (RoH), and the Information Sufficiency Ratio (ISR)—that yield a fixed, ISR=1 gating rule under permutation mixtures. Empirically, it validates a logarithmic dispersion across context depth, gains from permutation mixtures, and a causal dose-response showing information reduces hallucinations; a held-out audit demonstrates strong boundary alignment and low hallucination when using ISR gating. The work offers a practical abstention-based interface for reliable, auditable evidence-grounded QA, with clear operational guidance for deploying predicate-verifiable systems and managing computational overhead through permutation-mixing strategies.

Abstract

Transformers used for evidence-grounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (``hallucinations'' under a Bernoulli predicate). We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings. This objective can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation--realization gap via a Quantified Martingale Violation (QMV) bound that predicts growth in permutation dispersion under harmonic positional sensitivity. We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define \emph{Bits-to-Trust} (B2T), \emph{Risk-of-Hallucination} (RoH), and the \emph{Information Sufficiency Ratio} (ISR), together with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures. On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR gate attains 0.0--0.7\% hallucination with 20.6--27.9\% abstention (95\% confidence intervals).

Paper Structure

This paper contains 68 sections, 15 theorems, 34 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Under Assumption asmp:local, If, in addition, the decay is $(\alpha,C_{\mathrm{tot}})$-regular, then

Figures (3)

  • Figure 1: Order sensitivity and mixture gains (Experiment 1; $N=3{,}059$, $m=16$ banded permutations). Left: estimated log-scaling slope $b$ in the empirical law $|R_\pi|\approx a + b\log n$ (95% CI as defined in Section \ref{['sec:exp_reporting']}; Table \ref{['tab:dispersion_summary']}). Right: Jensen gap (nats/token) from uniform permutation mixtures, quantifying the cross-entropy improvement from averaging over evidence orderings.
  • Figure 2: Causal dose-response of information (Experiment 2). Randomized study holding prompt length fixed at $L{=}4$ chunks while varying support dose $d\in\{0,1,2,3\}$ (subset of Factuality Slice; Appendix H.4). We plot the net change from dose $0\to 3$ in hallucination rate against the induced decrease in the information-budget estimate ($\Delta\bar{\Delta} := \bar{\Delta}_{d=0}-\bar{\Delta}_{d=3} \approx 3\times 0.375 = 1.125$ nats). The fitted slope (OLS) is 0.127 fewer hallucinations per additional nat ($-12.7$ pp/nat); over the same shift, answer rate increases by 37.5pp and accuracy on attempts by 45.6pp.
  • Figure 3: Audit operating point (Experiment 3; $N=528$, $m=6$). Pre-specified held-out audit with ISR$=1$ and fixed seeds. Shaded region shows the 95% CI for coverage (1--abstention) and hallucination rate (Wilson intervals as in Section \ref{['sec:exp_reporting']}); midpoint shown for visualization.

Theorems & Definitions (22)

  • Definition 1: Permutation-induced residual
  • Theorem 1: Quantified Martingale Violation
  • Proposition 1: First-order model $\Rightarrow$ local stability
  • Theorem 2: Explicit log-scaling under first-order positional sensitivity
  • Proposition 2: Assumption-free JS certificate
  • Theorem 3: Permutation-mixture realizability with averaging head
  • proof
  • Theorem 4: MDL Optimality in Expectation
  • proof
  • Theorem 5: Expectation-level Decompression Law (EDFL)
  • ...and 12 more