Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication
Leon Chlon, Ahmed Karim, Maggie Chlon, MarcAntonio Awada
TL;DR
The paper tackles the problem that transformer-based evidence-grounded QA exhibits order sensitivity when presenting exchangeable evidence, leading to unreliable answers or hallucinations. It develops a theory linking permutation-induced dispersion to information budgets via the Quantified Martingale Violation (QMV) bound and the Expectation-level Decompression Law (EDFL), then derives deployment-facing constructs—Bits-to-Trust (B2T), Risk-of-Hallucination (RoH), and the Information Sufficiency Ratio (ISR)—that yield a fixed, ISR=1 gating rule under permutation mixtures. Empirically, it validates a logarithmic dispersion across context depth, gains from permutation mixtures, and a causal dose-response showing information reduces hallucinations; a held-out audit demonstrates strong boundary alignment and low hallucination when using ISR gating. The work offers a practical abstention-based interface for reliable, auditable evidence-grounded QA, with clear operational guidance for deploying predicate-verifiable systems and managing computational overhead through permutation-mixing strategies.
Abstract
Transformers used for evidence-grounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (``hallucinations'' under a Bernoulli predicate). We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings. This objective can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation--realization gap via a Quantified Martingale Violation (QMV) bound that predicts $\mathcal{O}(\log n)$ growth in permutation dispersion under harmonic positional sensitivity. We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define \emph{Bits-to-Trust} (B2T), \emph{Risk-of-Hallucination} (RoH), and the \emph{Information Sufficiency Ratio} (ISR), together with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures. On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR $= 1$ gate attains 0.0--0.7\% hallucination with 20.6--27.9\% abstention (95\% confidence intervals).
