Certified Robustness against Sparse Adversarial Perturbations via Data Localization
Ambar Pal, René Vidal, Jeremias Sulam
TL;DR
This work links the existence of robust classifiers under sparse ($\ell_0$) adversarial perturbations to a localization property of the data distribution: if robustness exists, class-conditionals must localize on small-volume regions, and strong localization with separation guarantees suffices to construct a robust classifier. Building on this theory, the authors introduce Box-NN, a nearest-box classifier whose decision regions are axis-aligned boxes, and provide a certifiable $\ell_0$ robustness guarantee via a margin based on distances to boxes. They develop an optimization framework to learn boxes from data, employing soft-min relaxations and initialization tricks to optimize a robustness-aware objective. Empirical results on MNIST and Fashion-MNIST show Box-NN yields state-of-the-art certified robustness against sparse attacks, often outperforming existing ensembling or randomized-smoothing baselines across a broad range of perturbation budgets. The work highlights that exploiting the data geometry through localized, box-shaped decision regions can yield lighter, more effective certifiable defenses against $\ell_0$ perturbations, while noting limitations in scalability to more complex datasets and the potential for richer decision boundaries in future work.
Abstract
Recent work in adversarial robustness suggests that natural data distributions are localized, i.e., they place high probability in small volume regions of the input space, and that this property can be utilized for designing classifiers with improved robustness guarantees for $\ell_2$-bounded perturbations. Yet, it is still unclear if this observation holds true for more general metrics. In this work, we extend this theory to $\ell_0$-bounded adversarial perturbations, where the attacker can modify a few pixels of the image but is unrestricted in the magnitude of perturbation, and we show necessary and sufficient conditions for the existence of $\ell_0$-robust classifiers. Theoretical certification approaches in this regime essentially employ voting over a large ensemble of classifiers. Such procedures are combinatorial and expensive or require complicated certification techniques. In contrast, a simple classifier emerges from our theory, dubbed Box-NN, which naturally incorporates the geometry of the problem and improves upon the current state-of-the-art in certified robustness against sparse attacks for the MNIST and Fashion-MNIST datasets.
