Table of Contents
Fetching ...

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Arsenios Scrivens

Abstract

Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) -- subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier's ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Abstract

Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) -- subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier's ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].

Paper Structure

This paper contains 61 sections, 11 theorems, 37 equations, 6 figures.

Key Result

Theorem 1

Let $P^+, P^-$ be distributions on $\mathbb{R}^k$ with $P^+ \ll P^-$ (absolute continuity). Suppose $D_{\alpha_0}(P^+ \| P^-) < \infty$ for some $\alpha_0 > p/(p-1)$. Then for any sequence of binary classifiers with false acceptance rates $\delta_n \leq c/n^p$ for some $c > 0, p > 1$: That is, bounded cumulative risk under any power-law schedule forces bounded cumulative utility.

Figures (6)

  • Figure 1: Overview of the two gate architectures: classification gates (left) threshold a feature-space representation, incurring $\delta > 0$; verification gates (right) certify safety via a Lipschitz ball, achieving $\delta = 0$. The classification impossibility (Theorem \ref{['thm:holder']}) and verification escape (Theorem \ref{['thm:escape']}) establish a structural dichotomy.
  • Figure 2: Scaling analysis of the Lipschitz ball verifier from $d = 84$ to $d = 17{,}408$. Ball soundness is 100% at all dimensions. Required mutation scale $\sigma^*$ decreases as $O(d^{-0.54})$.
  • Figure 3: Exponent-optimality validation (Theorem \ref{['thm:exponent']}). The NP classifier achieves 10--90% of the Hölder ceiling at deployment-relevant $\delta$, confirming near-tightness.
  • Figure 4: Finite-horizon utility ceiling (Theorem \ref{['thm:ceiling']}). The exact ceiling $U^*(N,B)$ grows as $\exp(O(\sqrt{\log N}))$ (subpolynomial), vastly below the MI bound ($\sqrt{N}$) and Hölder--Jensen ($N^{1-\beta}$). The ball verifier's utility grows linearly ($\Theta(N)$).
  • Figure 5: GPT-2 LoRA validation ($d_{\text{LoRA}} = 147{,}456$). Inside-ball: 50/50 safe ($\delta = 0$). Effective TPR $= 0.352$.
  • ...and 1 more figures

Theorems & Definitions (26)

  • Definition : Soundness
  • Definition
  • Theorem 1: Safety--Utility Impossibility
  • proof
  • Remark : On the per-step bound
  • Remark : Necessity of Distribution Overlap
  • Theorem 3: Exponent-Optimality of Hölder Bound
  • proof : Proof sketch
  • Corollary 1: Minimax Optimality
  • Theorem 4: NP Counting Impossibility
  • ...and 16 more