Table of Contents
Fetching ...

Algebraic Adversarial Attacks on Explainability Models

Lachlan Simpson, Federico Costanza, Kyle Millar, Adriel Cheng, Cheng-Chew Lim, Hong Gunn Chew

TL;DR

This work identifies a fundamental vulnerability of post-hoc explainability models by introducing algebraic adversarial attacks grounded in geometric deep learning. By exploiting Lie-group symmetries of neural networks, the authors show how adversarial explanations tilde{x} = g · x can be generated without optimization, with F(g · x) = F(x) and controllable perturbation size. They formalize attacks for path-based methods, neural conductance, and Smooth Grad/LIME, derive invariance properties, and prove bounds on explanation deviation tied to the perturbation tolerance. Empirical evaluation on MNIST, Wisconsin Breast Cancer, and mobile-network traffic demonstrates that explanations can be systematically manipulated while predictions remain intact, underscoring practical implications for safety-critical deployments and prompting future exploration of broader symmetry groups and threshold settings.

Abstract

Classical adversarial attacks are phrased as a constrained optimisation problem. Despite the efficacy of a constrained optimisation approach to adversarial attacks, one cannot trace how an adversarial point was generated. In this work, we propose an algebraic approach to adversarial attacks and study the conditions under which one can generate adversarial examples for post-hoc explainability models. Phrasing neural networks in the framework of geometric deep learning, algebraic adversarial attacks are constructed through analysis of the symmetry groups of neural networks. Algebraic adversarial examples provide a mathematically tractable approach to adversarial examples. We validate our approach of algebraic adversarial examples on two well-known and one real-world dataset.

Algebraic Adversarial Attacks on Explainability Models

TL;DR

This work identifies a fundamental vulnerability of post-hoc explainability models by introducing algebraic adversarial attacks grounded in geometric deep learning. By exploiting Lie-group symmetries of neural networks, the authors show how adversarial explanations tilde{x} = g · x can be generated without optimization, with F(g · x) = F(x) and controllable perturbation size. They formalize attacks for path-based methods, neural conductance, and Smooth Grad/LIME, derive invariance properties, and prove bounds on explanation deviation tied to the perturbation tolerance. Empirical evaluation on MNIST, Wisconsin Breast Cancer, and mobile-network traffic demonstrates that explanations can be systematically manipulated while predictions remain intact, underscoring practical implications for safety-critical deployments and prompting future exploration of broader symmetry groups and threshold settings.

Abstract

Classical adversarial attacks are phrased as a constrained optimisation problem. Despite the efficacy of a constrained optimisation approach to adversarial attacks, one cannot trace how an adversarial point was generated. In this work, we propose an algebraic approach to adversarial attacks and study the conditions under which one can generate adversarial examples for post-hoc explainability models. Phrasing neural networks in the framework of geometric deep learning, algebraic adversarial attacks are constructed through analysis of the symmetry groups of neural networks. Algebraic adversarial examples provide a mathematically tractable approach to adversarial examples. We validate our approach of algebraic adversarial examples on two well-known and one real-world dataset.

Paper Structure

This paper contains 21 sections, 16 theorems, 85 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Proposition 3.4

Let $G$ be a matrix Lie group acting on $\mathbb{R}^{n}$, let $\mathfrak{g}$ be its Lie algebra and let $x \in \mathbb{R}^{n}$. Then for an arbitrary $\varepsilon \geq 0$ there exists $g \in G$, $g \neq \mathrm{Id}$, such that the following inequality holds: Moreover, $g = \mathrm{exp}(t\mathbf{A})$ satisfies equation (eqn:bound) for all $t$ such that with $\mathbf{A} \in \mathfrak{g}$.

Figures (3)

  • Figure 1: Top: Input of a digit 5 from MNIST. Adversarial point $\Tilde{y} = y + \delta x$, where $x$ is of the form in equation (\ref{['eqn:adv_point_mnist']}) to resemble a 3-digit and $y$ is a 5-digit. Bottom: Clean integrated gradients explanation followed by adversarial explanation with increasing $\delta$. The error between the clean point $y$ and adversarial point $\Tilde{y}$ is $\|\delta x \|_{\infty}$.
  • Figure 2: Clean and adversarial explanation of a 'malignant' point in Wisconsin breast cancer dataset. Left: Integrated gradients feature importance. Right: SHAP feature importance.
  • Figure 3: Clean and adversarial explanation of a point in the network traffic dataset classified as 'Android'. Left: Integrated gradients feature importance. Right: SHAP feature importance. The error is $\|\delta x\|_{\infty} = 1$.

Theorems & Definitions (47)

  • Definition 3.1
  • Definition 3.2
  • Remark 3.3
  • Proposition 3.4
  • proof
  • Remark 3.5
  • Definition 4.1
  • Proposition 4.2
  • proof
  • Theorem 4.3
  • ...and 37 more