Table of Contents
Fetching ...

Certified $\ell_2$ Attribution Robustness via Uniformly Smoothed Attributions

Fan Wang, Adams Wai-Kin Kong

TL;DR

This work introduces uniformly smoothed attributions, defined as $h(\boldsymbol{x}) = \mathbb{E}_{\boldsymbol{\eta} \sim \mathcal{U}(\mathcal{B}(\mathbf{0}; r))}[g(\boldsymbol{x}+\boldsymbol{\eta})]$, to certify attribution robustness under $\ell_2$ perturbations. It derives a computable lower bound on the cosine similarity between smoothed attributions of clean and perturbed inputs, expressed as $T = \dfrac{\|h(\boldsymbol{x})\|_2}{\sqrt{\|h(\boldsymbol{x})\|_2^2 + (M V_U / V_{\mathcal{S}})^2}}$, where $V_{\mathcal{S}}$ is the volume of the smoothing ball and $V_U$ the volume of the intersection/union geometry of two sampling regions; this yields certifiable robustness against any $\|\boldsymbol{\delta}\|_2 \le \epsilon$. The framework supports alternative certified formulations to trade smoothing radius and perturbation size, and is demonstrated on MNIST, CIFAR-10, and ImageNet with well-bounded attributions like Integrated Gradients. Empirical results show uniformly smoothed attributions improve robustness against attribution attacks and that the certified bounds closely track observed behavior, confirming practical scalability and usefulness for trustworthy explanations.

Abstract

Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets.

Certified $\ell_2$ Attribution Robustness via Uniformly Smoothed Attributions

TL;DR

This work introduces uniformly smoothed attributions, defined as , to certify attribution robustness under perturbations. It derives a computable lower bound on the cosine similarity between smoothed attributions of clean and perturbed inputs, expressed as , where is the volume of the smoothing ball and the volume of the intersection/union geometry of two sampling regions; this yields certifiable robustness against any . The framework supports alternative certified formulations to trade smoothing radius and perturbation size, and is demonstrated on MNIST, CIFAR-10, and ImageNet with well-bounded attributions like Integrated Gradients. Empirical results show uniformly smoothed attributions improve robustness against attribution attacks and that the certified bounds closely track observed behavior, confirming practical scalability and usefulness for trustworthy explanations.

Abstract

Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets.
Paper Structure (28 sections, 5 theorems, 37 equations, 3 figures, 9 tables)

This paper contains 28 sections, 5 theorems, 37 equations, 3 figures, 9 tables.

Key Result

Theorem 1

Let $g:\mathbb{R}^d\rightarrow\mathbb{R}^d$ be a upper bounded attribution function, and $\bm{\eta}\stackrel{U}{\sim}\mathcal{B}(\bm{0}; r)$. Let $h$ be the smoothed version of $g$ as defined in (eqn:smoothed_attribution). Then, for all $\tilde{\bm{x}}\in\left\{\bm{x} + \bm{\delta}\vert\Vert\bm{\del Here, $M$ is the upper bound of $g$. $V_{\mathcal{S}}$ is the volume of the $\ell_2$-ball $\mathcal

Figures (3)

  • Figure 1: (left) Examples of attributions (zoom in for better visibility). We choose to show the integrated gradients (IG) and its corresponding smoothing results. For Gaussian smoothing, the noise level is set to $\sigma=0.2$ and for uniformly smoothed IG, $\ell_2$ ball with radius $\sqrt{3}\sigma$ is used. (right) A 2D illustration of the volumes of $\mathcal{B}(\bm{x};r)$ and $\mathcal{B}(\tilde{\bm{x}};r)$, as well as the relationship between $h(\bm{x})$ and $h(\tilde{\bm{x}})$. Here $h(\bm{x})$ is the original attribution, and $a\bm{v}$ represents the magnitude and direction of the translation of $h(\bm{x})$ after the sample is perturbed. $V_U$ in Theorem \ref{['thm:general']} is the volume of shaded region in the figure, and $V_\mathcal{S}$ is the volume of each individual ball. When $h(\bm{x})$ is fixed, the lower bound of the cosine similarity between $h(\bm{x})$ and $h(\bm{x})+a\bm{v}$ can be derived as a function of volumes
  • Figure 2: The gap between theoretical bounds and empirical cosine similarity between original and perturbed attribution evaluated on CIFAR-10 using IGR.
  • Figure 3: Additional visualization of the attribution maps of the (a) original image, (b) IG, (c) Gaussian smoothed IG, and (d) uniformly smoothed IG.

Theorems & Definitions (8)

  • Theorem 1
  • Corollary 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • proof