Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation
Matthias Hein, Maksym Andriushchenko
TL;DR
The paper tackles the vulnerability of classifiers to adversarial perturbations by introducing instance-specific formal robustness guarantees. It derives a lower bound on the perturbation norm that preserves the predicted class, using a local cross-Lipschitz constant, and presents the Cross-Lipschitz regularization to tighten these guarantees. The authors provide concrete bounds for kernel methods (Gaussian kernels) and one-hidden-layer neural networks, along with efficient box-constrained adversarial sample generation. Empirically, Cross-Lipschitz regularization improves robustness bounds and often maintains or enhances accuracy across MNIST, CIFAR-10, and German Traffic Sign benchmarks, with local constants offering substantially tighter guarantees than global ones. This work lays groundwork for reliable deployment of ML systems in safety-critical contexts by strengthening formal robustness guarantees and practical adversarial analysis.
Abstract
Recent work has shown that state-of-the-art classifiers are quite brittle, in the sense that a small adversarial change of an originally with high confidence correctly classified input leads to a wrong classification again with high confidence. This raises concerns that such classifiers are vulnerable to attacks and calls into question their usage in safety-critical systems. We show in this paper for the first time formal guarantees on the robustness of a classifier by giving instance-specific lower bounds on the norm of the input manipulation required to change the classifier decision. Based on this analysis we propose the Cross-Lipschitz regularization functional. We show that using this form of regularization in kernel methods resp. neural networks improves the robustness of the classifier without any loss in prediction performance.
