MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification
Sajjad Amini, Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr
TL;DR
MeanSparse introduces a post-training mean-centered feature sparsification that blocks non-robust, near-mean variations in feature activations. By computing per-channel means and variances and applying a threshold $Th=\alpha \sigma_{ch}$ to replace values near the mean with the channel mean, the method reduces attacker exploitability while preserving information outside the blocked region. The approach yields state-of-the-art AutoAttack accuracy on RobustBench models across CIFAR-10, CIFAR-100, and ImageNet, with notable gains under $\ell_\infty$ and $\ell_2$ threats and compatibility with both PGD and TRADES adversarial training. It remains lightweight to implement, requires only probe statistics from training data, and demonstrates robustness gains under adaptive attacks and black-box settings, albeit with limitations in non-adversarial training scenarios. Overall, MeanSparse offers a practical, scalable enhancement to robustness that can be integrated post hoc with minimal utility loss.
Abstract
We present a simple yet effective method to improve the robustness of both Convolutional and attention-based Neural Networks against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean, and we show that such reduced variations merely affect the model's utility, yet they strongly attenuate the adversarial perturbations and decrease the attacker's success rate. Our experiments show that, when applied to the top models in the RobustBench leaderboard, MeanSparse achieves a new robustness record of 75.28% (from 73.71%), 44.78% (from 42.67%) and 62.12% (from 59.56%) on CIFAR-10, CIFAR-100 and ImageNet, respectively, in terms of AutoAttack accuracy. Code is available at https://github.com/SPIN-UMass/MeanSparse
