MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

Sajjad Amini; Mohammadreza Teymoorianfard; Shiqing Ma; Amir Houmansadr

MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

Sajjad Amini, Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr

TL;DR

MeanSparse introduces a post-training mean-centered feature sparsification that blocks non-robust, near-mean variations in feature activations. By computing per-channel means and variances and applying a threshold $Th=\alpha \sigma_{ch}$ to replace values near the mean with the channel mean, the method reduces attacker exploitability while preserving information outside the blocked region. The approach yields state-of-the-art AutoAttack accuracy on RobustBench models across CIFAR-10, CIFAR-100, and ImageNet, with notable gains under $\ell_\infty$ and $\ell_2$ threats and compatibility with both PGD and TRADES adversarial training. It remains lightweight to implement, requires only probe statistics from training data, and demonstrates robustness gains under adaptive attacks and black-box settings, albeit with limitations in non-adversarial training scenarios. Overall, MeanSparse offers a practical, scalable enhancement to robustness that can be integrated post hoc with minimal utility loss.

Abstract

We present a simple yet effective method to improve the robustness of both Convolutional and attention-based Neural Networks against adversarial examples by post-processing an adversarially trained model. Our technique, MeanSparse, cascades the activation functions of a trained model with novel operators that sparsify mean-centered feature vectors. This is equivalent to reducing feature variations around the mean, and we show that such reduced variations merely affect the model's utility, yet they strongly attenuate the adversarial perturbations and decrease the attacker's success rate. Our experiments show that, when applied to the top models in the RobustBench leaderboard, MeanSparse achieves a new robustness record of 75.28% (from 73.71%), 44.78% (from 42.67%) and 62.12% (from 59.56%) on CIFAR-10, CIFAR-100 and ImageNet, respectively, in terms of AutoAttack accuracy. Code is available at https://github.com/SPIN-UMass/MeanSparse

MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

TL;DR

to replace values near the mean with the channel mean, the method reduces attacker exploitability while preserving information outside the blocked region. The approach yields state-of-the-art AutoAttack accuracy on RobustBench models across CIFAR-10, CIFAR-100, and ImageNet, with notable gains under

and

threats and compatibility with both PGD and TRADES adversarial training. It remains lightweight to implement, requires only probe statistics from training data, and demonstrates robustness gains under adaptive attacks and black-box settings, albeit with limitations in non-adversarial training scenarios. Overall, MeanSparse offers a practical, scalable enhancement to robustness that can be integrated post hoc with minimal utility loss.

Abstract

Paper Structure (28 sections, 10 equations, 3 figures, 13 tables)

This paper contains 28 sections, 10 equations, 3 figures, 13 tables.

Introduction
Preliminaries
Notations
Related Work
Threat Model
Methodology
Intuition from Regularized Optimization Objective
Mean-centered Feature Sparsification
MeanSparse Design
Sparsification in Post-processing
Adaptive Sparsification Using Feature Standard Deviation
Per-channel Sparsification
Complete Pipeline
Experiments
Evaluation Metrics
...and 13 more sections

Figures (3)

Figure 1: Mean-based sparsification operator used in the MeanSparse technique for hypothetical channel ch. The first column represents the design procedue. First, the mean ($\mu_\mathrm{ch}$) and standard deviation ($\sigma_\mathrm{ch}$) are calculated over the training set (top figure). The mean-based sparsification operator is designed with hyper-parameter $\alpha$ which blocks the variations in the $\alpha\sigma_{\mathrm{ch}}$ vicinity of $\mu_\mathrm{ch}$ (bottom figure). The second column represents how mean-based sparsification affects the input features for one test sample (top figure) and generates output features (bottom figure). The effect of mean-based sparsification over the feature histogram is also demonstrated in the third column.
Figure 2: Mean-centered feature used in regularized optimization problem of \ref{['l0problem']}
Figure 3: Original models performance along with their performance after integrating with MeanSparse technique. For CIFAR-10 dataset with $\ell_\infty$ attack, we have WideResNet-94-16 (Rank 1) arlbd24, RaWideResNet-70-16 (Rank 4) rpapx23 and WideResNet-70-16 (Rank 5) bdmwp23 while for $\ell_2$ attack we have WideResNet-70-16 (Rank 1) bdmwp23. For ImageNet dataset with $\ell_\infty$ attack, we have Swin-L (Rank 2) acsld24, ConvNeXt-L (Rank 4) acsld23 and RaWideResNet (Rank 12) rpapx23. For CIFAR-100 dataset with $\ell_\infty$ attack, we have WideResNet-70-16 (Rank 1) bdmwp23 (all the rankings are based on RobustBench rasca20)

Theorems & Definitions (1)

Definition 1: Proximal operator papb14

MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

TL;DR

Abstract

MeanSparse: Post-Training Robustness Enhancement Through Mean-Centered Feature Sparsification

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (1)