Table of Contents
Fetching ...

Certified but Fooled! Breaking Certified Defences with Ghost Certificates

Quoc Viet Vo, Tashreque M. Haq, Paul Montague, Tamas Abraham, Ehsan Abbasnejad, Damith C. Ranasinghe

TL;DR

The paper addresses the vulnerability of probabilistic robustness certification by introducing GhostCert, a region-aware adversarial attack that simultaneously misclassifies and yields spoofed, large-certification radii. GhostCert constrains perturbations to salient, semantically coherent regions identified via GradCAM/Attention and SAM, optimizing a certification-aware objective over smoothing samples with PGD under a budget $\epsilon$, evaluated on ImageNet against RS, RS Ensemble, and DensePure. Key findings show GhostCert achieving higher attack success rates and bigger spoofed radii than prior attacks like Shadow Attack, with perturbations that are perceptually indistinguishable, and even causing DoS/abstain outcomes in some cases; a user study confirms perceptual naturalness of the adversarial images. The work demonstrates that certifiably robust defenses can be circumvented in practice and urges the development of stronger certification mechanisms to resist region-focused, semantics-preserving perturbations and to avoid misleading robustness guarantees.

Abstract

Certified defenses promise provable robustness guarantees. We study the malicious exploitation of probabilistic certification frameworks to better understand the limits of guarantee provisions. Now, the objective is to not only mislead a classifier, but also manipulate the certification process to generate a robustness guarantee for an adversarial input certificate spoofing. A recent study in ICLR demonstrated that crafting large perturbations can shift inputs far into regions capable of generating a certificate for an incorrect class. Our study investigates if perturbations needed to cause a misclassification and yet coax a certified model into issuing a deceptive, large robustness radius for a target class can still be made small and imperceptible. We explore the idea of region-focused adversarial examples to craft imperceptible perturbations, spoof certificates and achieve certification radii larger than the source class ghost certificates. Extensive evaluations with the ImageNet demonstrate the ability to effectively bypass state-of-the-art certified defenses such as Densepure. Our work underscores the need to better understand the limits of robustness certification methods.

Certified but Fooled! Breaking Certified Defences with Ghost Certificates

TL;DR

The paper addresses the vulnerability of probabilistic robustness certification by introducing GhostCert, a region-aware adversarial attack that simultaneously misclassifies and yields spoofed, large-certification radii. GhostCert constrains perturbations to salient, semantically coherent regions identified via GradCAM/Attention and SAM, optimizing a certification-aware objective over smoothing samples with PGD under a budget , evaluated on ImageNet against RS, RS Ensemble, and DensePure. Key findings show GhostCert achieving higher attack success rates and bigger spoofed radii than prior attacks like Shadow Attack, with perturbations that are perceptually indistinguishable, and even causing DoS/abstain outcomes in some cases; a user study confirms perceptual naturalness of the adversarial images. The work demonstrates that certifiably robust defenses can be circumvented in practice and urges the development of stronger certification mechanisms to resist region-focused, semantics-preserving perturbations and to avoid misleading robustness guarantees.

Abstract

Certified defenses promise provable robustness guarantees. We study the malicious exploitation of probabilistic certification frameworks to better understand the limits of guarantee provisions. Now, the objective is to not only mislead a classifier, but also manipulate the certification process to generate a robustness guarantee for an adversarial input certificate spoofing. A recent study in ICLR demonstrated that crafting large perturbations can shift inputs far into regions capable of generating a certificate for an incorrect class. Our study investigates if perturbations needed to cause a misclassification and yet coax a certified model into issuing a deceptive, large robustness radius for a target class can still be made small and imperceptible. We explore the idea of region-focused adversarial examples to craft imperceptible perturbations, spoof certificates and achieve certification radii larger than the source class ghost certificates. Extensive evaluations with the ImageNet demonstrate the ability to effectively bypass state-of-the-art certified defenses such as Densepure. Our work underscores the need to better understand the limits of robustness certification methods.

Paper Structure

This paper contains 14 sections, 11 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of our attack formulation. For a given source image (half track) and certification radii, we show the corresponding adversarial examples created by our attack GhostCert and Shadow Attack in ICLR Ghiasi2020 against three certified defense methods: Randomized Smoothing (with Resnet50), Smoothed Ensemble, and DensePure (Diffusion based denoiser & Transformer under Randomized Smoothing). Shadow fails to generate a spoofed certificate (✗) for Smoothed Ensemble even with a larger distortion. GhostCert generates more natural-looking adversarials across all three defenses while achieving misclassification with: i) higher spoofed certification radii; and ii) significantly lower $l_2$ norms ($||\delta||_2$) compared to the Shadow Attack (Adversarial and Imperceptible). GhostCert results also surpass the certification radii of the source image (Strongly Certified)---see Fig. \ref{['fig:naturalism_score']} for results of a user-study on imperceptibility. Code: https://github.com/ghostcert
  • Figure 2: A pictorial illustration of GhostCert. Starting from the source image $x$ with label Rhodesian ridgeback and given a target label Red fox, region proposal are evaluated to select regions for manipulation considering salient features important for classification decisions. The idea is to preserve semantics whilst minimising distortions. Then, crafting perturbations constrained to the salient regions, $\delta$, yields the adversarial $x+\delta$ misclassified as a Red fox while being strongly certified with imperceptible visual differences to the source image $x$.
  • Figure 3: Illustrative examples of successful attacks by GhostCert are presented. For each case, we display the adversarial image and its corresponding perturbation generated by both the Shadow attack and our method, GhostCert. The results clearly show that GhostCert produces strongly certified adversarial examples with perturbations that are more visually imperceptible than those from the Shadow attack, while also achieving higher spoofed certification radii at lower $l_2$ norms ($||\delta||_2$).
  • Figure 4: Comparing ASR and spoofed radii for three attacks in untargeted settings against (a) single ResNet-50 under Randomized Smoothing (RS) and (b) an ensemble of three consistency ResNet-50 models under RS vs. distortion $\|\delta\|_2$ budgets.
  • Figure 5: Comparing ASR and spoofed radii for three attacks in a targeted setting against (a) single ResNet-50 under Randomized Smoothing (RS) and (b) an ensemble of three consistency ResNet-50 models under RS vs. distortion $\|\delta\|_2$ budgets.
  • ...and 7 more figures