Characterizing Data Point Vulnerability via Average-Case Robustness
Tessa Han, Suraj Srinivas, Himabindu Lakkaraju
TL;DR
The paper introduces average-case robustness, defined as $p^\text{robust}_\sigma(\mathbf{x}) = P_{\epsilon \sim \mathcal{N}(0,\sigma^2)}[ \arg\max_i f_i(\mathbf{x}+\epsilon) = t ]$, as a complementary lens to adversarial robustness for understanding data-point vulnerability. It develops analytical estimators for multi-class classifiers, including the Taylor estimator $p^\text{taylor}_\sigma$ and the MMSE estimator $p^\text{mmse}_\sigma$, both using a multivariate Gaussian CDF with covariance $\mathbf{U}\mathbf{U}^\top$, plus fast mv-sigmoid and softmax-based approximations to improve efficiency and differentiability. The authors provide finite-sample error bounds, demonstrate fast convergence (often with small $N$) and substantial computational gains over Monte Carlo, and validate the approach on MNIST, FMNIST, CIFAR-10/100 with various architectures, showing accurate estimates of local robustness and the ability to identify vulnerable samples and class-robustness bias. They also show that softmax probabilities are a poor proxy for average-case robustness in general, while probust correlates with canonical versus boundary data and aids dataset debugging. Collectively, the work offers practical, scalable tools for diagnosing and understanding data-point vulnerability and model robustness beyond worst-case guarantees.
Abstract
Studying the robustness of machine learning models is important to ensure consistent model behaviour across real-world settings. To this end, adversarial robustness is a standard framework, which views robustness of predictions through a binary lens: either a worst-case adversarial misclassification exists in the local region around an input, or it does not. However, this binary perspective does not account for the degrees of vulnerability, as data points with a larger number of misclassified examples in their neighborhoods are more vulnerable. In this work, we consider a complementary framework for robustness, called average-case robustness, which measures the fraction of points in a local region that provides consistent predictions. However, computing this quantity is hard, as standard Monte Carlo approaches are inefficient especially for high-dimensional inputs. In this work, we propose the first analytical estimators for average-case robustness for multi-class classifiers. We show empirically that our estimators are accurate and efficient for standard deep learning models and demonstrate their usefulness for identifying vulnerable data points, as well as quantifying robustness bias of models. Overall, our tools provide a complementary view to robustness, improving our ability to characterize model behaviour.
