Et Tu Certifications: Robustness Certificates Yield Better Adversarial Examples
Andrew C. Cullen, Shijie Liu, Paul Montague, Sarah M. Erfani, Benjamin I. P. Rubinstein
TL;DR
The paper demonstrates a counterintuitive risk: robustness certificates can be exploited to craft norm-minimising adversarial examples more efficiently, challenging the notion that published certifications universally boost security. It introduces Certification Aware Attack (CAA), a two-stage framework that uses certification radii to navigate the input space more effectively and then refines adversarial perturbations while preserving the attack's target label. Empirical results across MNIST, CIFAR-10, and ImageNet show that CAA yields smaller perturbations and faster identification of adversarial examples than established attacks on models protected by randomized smoothing and IBP-based certors, with up to substantial reductions in the median attack size relative to certified bounds. The work emphasizes that releasing certifications can inadvertently increase attack surface, and discusses mitigation strategies such as withholding certification details and relying on class-level disclosures to mitigate risk, while also offering a framework to better assess the tightness of certification bounds in practice.
Abstract
In guaranteeing the absence of adversarial examples in an instance's neighbourhood, certification mechanisms play an important role in demonstrating neural net robustness. In this paper, we ask if these certifications can compromise the very models they help to protect? Our new \emph{Certification Aware Attack} exploits certifications to produce computationally efficient norm-minimising adversarial examples $74 \%$ more often than comparable attacks, while reducing the median perturbation norm by more than $10\%$. While these attacks can be used to assess the tightness of certification bounds, they also highlight that releasing certifications can paradoxically reduce security.
