Table of Contents
Fetching ...

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

Chen Feng, Ziquan Liu, Zhuo Zhi, Ilija Bogunovic, Carsten Gerner-Beuerle, Miguel Rodrigues

TL;DR

PROSAC introduces a provably safe certification framework for machine learning models under adversarial attacks, delivering population-level guarantees via a calibration-set–based hypothesis test for $(\alpha,\zeta)$-safety. It combines a max-adversarial-risk formulation with a finite-sample $p$-value and a Bayesian optimization procedure (GP-UCB) to systematically search attacker configurations while preserving statistical guarantees. The approach is demonstrated on Vision Transformers and ResNets under multiple attack types, revealing that ViTs generally achieve higher robustness and that larger models tend to be more robust; adversarial training further improves safety at some cost to accuracy. By providing formal, regulatory-friendly guarantees beyond empirical risk, PROSAC offers a principled tool for certifying robustness of black-box ML systems against adversarial perturbations; future work includes extending to multimodal settings and addressing remaining gaps in error-control properties.

Abstract

It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of $(α,ζ)$-safe machine learning model. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than $α$ (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than $α$), is less than $ζ$. We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is $(α,ζ)$-safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models - including various sizes of vision Transformer (ViT) and ResNet models - impaired by a variety of adversarial attacks, such as PGDAttack, MomentumAttack, GenAttack and BanditAttack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and large models are generally more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks

TL;DR

PROSAC introduces a provably safe certification framework for machine learning models under adversarial attacks, delivering population-level guarantees via a calibration-set–based hypothesis test for -safety. It combines a max-adversarial-risk formulation with a finite-sample -value and a Bayesian optimization procedure (GP-UCB) to systematically search attacker configurations while preserving statistical guarantees. The approach is demonstrated on Vision Transformers and ResNets under multiple attack types, revealing that ViTs generally achieve higher robustness and that larger models tend to be more robust; adversarial training further improves safety at some cost to accuracy. By providing formal, regulatory-friendly guarantees beyond empirical risk, PROSAC offers a principled tool for certifying robustness of black-box ML systems against adversarial perturbations; future work includes extending to multimodal settings and addressing remaining gaps in error-control properties.

Abstract

It is widely known that state-of-the-art machine learning models, including vision and language models, can be seriously compromised by adversarial perturbations. It is therefore increasingly relevant to develop capabilities to certify their performance in the presence of the most effective adversarial attacks. Our paper offers a new approach to certify the performance of machine learning models in the presence of adversarial attacks with population level risk guarantees. In particular, we introduce the notion of -safe machine learning model. We propose a hypothesis testing procedure, based on the availability of a calibration set, to derive statistical guarantees providing that the probability of declaring that the adversarial (population) risk of a machine learning model is less than (i.e. the model is safe), while the model is in fact unsafe (i.e. the model adversarial population risk is higher than ), is less than . We also propose Bayesian optimization algorithms to determine efficiently whether a machine learning model is -safe in the presence of an adversarial attack, along with statistical guarantees. We apply our framework to a range of machine learning models - including various sizes of vision Transformer (ViT) and ResNet models - impaired by a variety of adversarial attacks, such as PGDAttack, MomentumAttack, GenAttack and BanditAttack, to illustrate the operation of our approach. Importantly, we show that ViT's are generally more robust to adversarial attacks than ResNets, and large models are generally more robust than smaller models. Our approach goes beyond existing empirical adversarial risk-based certification guarantees. It formulates rigorous (and provable) performance guarantees that can be used to satisfy regulatory requirements mandating the use of state-of-the-art technical tools.
Paper Structure (20 sections, 4 theorems, 13 equations, 1 figure, 1 algorithm)

This paper contains 20 sections, 4 theorems, 13 equations, 1 figure, 1 algorithm.

Key Result

Proposition 1

Let $p^*$ be a p-value associated with the hypothesis testing problem where the null hypothesis is $\mathcal{H}_0 : \mathcal{R}^* > \alpha$ or, equivalently, $\mathcal{H}_0 : \exists~\lambda\in\Lambda, \mathcal{R}_{\lambda} > \alpha$. It follows immediately that the machine learning model is $(\alph provided that the null hypothesis is rejected if and only if $p^* \leq \zeta$.

Figures (1)

  • Figure 1: Model certification w.r.t different attacking budgets $\epsilon$.

Theorems & Definitions (5)

  • Definition 1
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Theorem 4: $(\alpha,\zeta)$-Safe Model with GP-UCB