Towards Certification of Uncertainty Calibration under Adversarial Attacks
Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz, Philip H. S. Torr, Adel Bibi
TL;DR
This work tackles the problem of unreliable uncertainty estimates in neural classifiers under adversarial perturbations by introducing certified calibration. It develops a formal framework that bounds calibration error through two metrics: the certified Brier score (CBS) with a closed-form bound and the approximate certified calibration error (ACCE) via a mixed-integer program, facilitated by Gaussian smoothing certificates. The authors also introduce Adversarial Calibration Training (ACT) to actively improve calibrated uncertainty, proposing two variants (Brier-ACT and ACCE-ACT) that optimize losses under calibration adversaries using ADMM-based optimization. Empirical results across CIFAR-10, FashionMNIST, SVHN, CIFAR-100, and ImageNet show that ACT can reduce calibration error and Brier score at large certified radii, and that ACCE approximations via ADMM outperform baselines. Overall, the paper provides a practical pathway to certify and improve calibrated uncertainty in the presence of adversarial threats, with potential impact on safety-critical deployment where confidence reliability is as crucial as accuracy.
Abstract
Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, \textit{certification methods} have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. Furthermore, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier score or the expected calibration error. We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the expected calibration error. Finally, we propose novel calibration attacks and demonstrate how they can improve model calibration through \textit{adversarial calibration training}.
