MiniFool -- Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks
Lucie Flek, Oliver Janik, Philipp Alexander Jung, Akbar Karimi, Timo Saala, Alexander Schmidt, Matthias Schott, Philipp Soldin, Matthias Thiesmeyer, Christopher Wiebusch, Ulrich Willemsen
TL;DR
MiniFool proposes physics-constraint-aware adversarial attacks that perturb inputs under experimental uncertainties to flip neural network classifications in physics analyses. It optimizes a combined cost λ = α η + β (f_{i^*}(x^{a}; θ) − g)^2 with η = (1/N) Σ_i ((x_i^0 − x_i^{a})/σ_i)^2 and σ_i scaled by an attack parameter s, enabling per-event robustness assessment. The method is demonstrated on MNIST, IceCube's ν_τ search, and CMS Open Data b-jet tagging, showing that robustness can be quantified by s-scans and that misclassified events are generally more susceptible to perturbations than correctly classified ones. The work provides an open-source implementation and suggests physics-informed adversarial testing as a tool to verify and strengthen neural-network analyses in physics contexts, with future directions including richer uncertainty models and applications beyond classification.
Abstract
In this paper, we present a new algorithm, MiniFool, that implements physics-inspired adversarial attacks for testing neural network-based classification tasks in particle and astroparticle physics. While we initially developed the algorithm for the search for astrophysical tau neutrinos with the IceCube Neutrino Observatory, we apply it to further data from other science domains, thus demonstrating its general applicability. Here, we apply the algorithm to the well-known MNIST data set and furthermore, to Open Data data from the CMS experiment at the Large Hadron Collider. The algorithm is based on minimizing a cost function that combines a $χ^2$ based test-statistic with the deviation from the desired target score. The test statistic quantifies the probability of the perturbations applied to the data based on the experimental uncertainties. For our studied use cases, we find that the likelihood of a flipped classification differs for both the initially correctly and incorrectly classified events. When testing changes of the classifications as a function of an attack parameter that scales the experimental uncertainties, the robustness of the network decision can be quantified. Furthermore, this allows testing the robustness of the classification of unlabeled experimental data.
