Table of Contents
Fetching ...

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha

TL;DR

This work systematically investigates the assumption that safety mechanisms transfer across model updates and finds it fails, exposing a fundamental fragility in production AI safety architectures and challenging the assumption that safety mechanisms transfer across model versions.

Abstract

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

TL;DR

This work systematically investigates the assumption that safety mechanisms transfer across model updates and finds it fails, exposing a fundamental fragility in production AI safety architectures and challenging the assumption that safety mechanisms transfer across model versions.

Abstract

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude (corresponding to angular drift on the embedding sphere) reduce classifier performance from to ROC-AUC. Critically, mean confidence only drops , producing dangerous silent failures where of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20 worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.
Paper Structure (38 sections, 29 equations, 3 figures, 1 table)

This paper contains 38 sections, 29 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Instruction-tuned models exhibit worse safety classifier robustness. Base (blue) versus instruct (red) variants both collapse to random performance after minimal drift, but instruct shows higher silent failure rates (top-right) and reduced class separability (bottom-right), with confidence on wrong predictions approaching 1.0 for both (bottom-left) indicating severe mis-calibration.
  • Figure 2: Classifier brittleness exhibits sharp threshold, mechanism-invariance, and irreversibility. ROC-AUC collapses from 0.90 to 0.51 uniformly across drift types (top), with failure cliff at $\sigma=0.01$--$0.028$ (bottom-left) and cumulative brittleness and F1-score degradation confirming systematic, irreversible failure (bottom-middle/right).
  • Figure 3: Confidence becomes meaningless under drift. Calibration curves at checkpoints 0, 2, 4, 7 ($\sigma=0.000$ to $0.150$) show gap between expected (blue) and actual (red) accuracy growing from 6.1% to 26.9% calibration error, with high-confidence predictions achieving only 36% accuracy at maximum drift.