I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo; Vinija Jain; Divya Chaudhary; Aman Chadha

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha

TL;DR

This work systematically investigates the assumption that safety mechanisms transfer across model updates and finds it fails, exposing a fundamental fragility in production AI safety architectures and challenging the assumption that safety mechanisms transfer across model versions.

Abstract

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

TL;DR

Abstract

(corresponding to

angular drift on the embedding sphere) reduce classifier performance from

ROC-AUC. Critically, mean confidence only drops

, producing dangerous silent failures where

of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20

worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

Paper Structure (38 sections, 29 equations, 3 figures, 1 table)

This paper contains 38 sections, 29 equations, 3 figures, 1 table.

Introduction
Problem Formulation
Experimental Design & Results
Implications and Conclusions
Writing Assistance.
Limitations of LLM Use.
Potential Societal Impact.
Data Provenance and Consent.
Privacy and Deidentification.
Potential Harms.
Institutional Review.
Broader Recommendations.
Analysis
Mathematical Formulation of Experiments
Embedding Extraction
...and 23 more sections

Figures (3)

Figure 1: Instruction-tuned models exhibit worse safety classifier robustness. Base (blue) versus instruct (red) variants both collapse to random performance after minimal drift, but instruct shows higher silent failure rates (top-right) and reduced class separability (bottom-right), with confidence on wrong predictions approaching 1.0 for both (bottom-left) indicating severe mis-calibration.
Figure 2: Classifier brittleness exhibits sharp threshold, mechanism-invariance, and irreversibility. ROC-AUC collapses from 0.90 to 0.51 uniformly across drift types (top), with failure cliff at $\sigma=0.01$--$0.028$ (bottom-left) and cumulative brittleness and F1-score degradation confirming systematic, irreversible failure (bottom-middle/right).
Figure 3: Confidence becomes meaningless under drift. Calibration curves at checkpoints 0, 2, 4, 7 ($\sigma=0.000$ to $0.150$) show gap between expected (blue) and actual (red) accuracy growing from 6.1% to 26.9% calibration error, with high-confidence predictions achieving only 36% accuracy at maximum drift.

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

TL;DR

Abstract

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Authors

TL;DR

Abstract

Table of Contents

Figures (3)