Table of Contents
Fetching ...

Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

Arsenios Scrivens

Abstract

Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in {84, 240, 768, 2688, 5760, 9984, 17408} using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher-v4, 10 chains yield +4.31 reward improvement with delta=0; on Qwen2.5-7B-Instruct during LoRA fine-tuning, 42 chain transitions traverse 234x the single-ball radius with zero safety violations across 200 steps. A 50-prompt oracle confirms oracle-agnosticity. Compositional per-group verification enables radii up to 37x larger than full-network balls. At d<=17408, delta=0 is unconditional; at LLM scale, conditional on estimated Lipschitz constants.

Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates

Abstract

Can classifier-based safety gates maintain reliable oversight as AI systems improve over hundreds of iterations? We provide comprehensive empirical evidence that they cannot. On a self-improving neural controller (d=240), eighteen classifier configurations -- spanning MLPs, SVMs, random forests, k-NN, Bayesian classifiers, and deep networks -- all fail the dual conditions for safe self-improvement. Three safe RL baselines (CPO, Lyapunov, safety shielding) also fail. Results extend to MuJoCo benchmarks (Reacher-v4 d=496, Swimmer-v4 d=1408, HalfCheetah-v4 d=1824). At controlled distribution separations up to delta_s=2.0, all classifiers still fail -- including the NP-optimal test and MLPs with 100% training accuracy -- demonstrating structural impossibility. We then show the impossibility is specific to classification, not to safe self-improvement itself. A Lipschitz ball verifier achieves zero false accepts across dimensions d in {84, 240, 768, 2688, 5760, 9984, 17408} using provable analytical bounds (unconditional delta=0). Ball chaining enables unbounded parameter-space traversal: on MuJoCo Reacher-v4, 10 chains yield +4.31 reward improvement with delta=0; on Qwen2.5-7B-Instruct during LoRA fine-tuning, 42 chain transitions traverse 234x the single-ball radius with zero safety violations across 200 steps. A 50-prompt oracle confirms oracle-agnosticity. Compositional per-group verification enables radii up to 37x larger than full-network balls. At d<=17408, delta=0 is unconditional; at LLM scale, conditional on estimated Lipschitz constants.

Paper Structure

This paper contains 61 sections, 2 theorems, 6 equations, 9 figures, 1 algorithm.

Key Result

Theorem 1

Let $P^+, P^-$ be distributions on $\mathbb{R}^k$ with $P^+ \ll P^-$ and $D_{\alpha_0}(P^+ \| P^-) < \infty$ for some $\alpha_0 > p/(p-1)$. Then for any sequence of binary classifiers with $\delta_n \leq c/n^p$ for some $c > 0, p > 1$: $\blacktriangleleft$$\blacktriangleleft$

Figures (9)

  • Figure 1: Overview of the classification--verification dichotomy. Classification gates (left) threshold a feature-space representation, incurring $\delta > 0$; verification gates (right) certify safety via a Lipschitz ball, achieving $\delta = 0$.
  • Figure 2: Classifier failure across five baselines (§\ref{['sec:baselines']}). At natural operating thresholds, all five have constant per-step $\delta > 0$, so $\sum\delta$ diverges (left). The Hölder coupling (Theorem \ref{['thm:holder']}) ensures that enforcing $\sum\delta < \infty$ would also force $\sum\mathrm{TPR} < \infty$, making the dual conditions unsatisfiable.
  • Figure 3: Scaling analysis of the Lipschitz ball verifier from $d = 84$ to $d = 17{,}408$. Ball soundness is 100% at all dimensions. Required mutation scale $\sigma^*$ decreases as $O(d^{-0.54})$.
  • Figure 4: Exponent-optimality validation. The NP classifier achieves 10--90% of the Hölder ceiling at deployment-relevant $\delta$.
  • Figure 5: Finite-horizon utility ceiling (D1 Theorem 5). The exact ceiling grows as $\exp(O(\sqrt{\log N}))$, vastly below the ball verifier's linear $\Theta(N)$ growth.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1: Safety--Utility Impossibility; D1 Theorem 1
  • Theorem 2: Verification Escape; D1 Theorem 2