Saliency strikes back: How filtering out high frequencies improves white-box explanations
Sabine Muzellec, Thomas Fel, Victor Boutin, Léo andéol, Rufin VanRullen, Thomas Serre
TL;DR
This work identifies a prevalent flaw in gradient-based white-box explanations: high-frequency artifacts in the gradient hinder faithfulness. It introduces FORGrad, a Fourier-based repair that applies an architecture- and method-specific low-pass filter to the gradient $\nabla_{\bm{x}} \bm{f}(\bm{x})$, with the cutoff $\sigma^{\star}$ chosen by maximizing the faithfulness metric over a validation set via $\sigma^{\star} = \arg\max_{\sigma} \mathbb{E}_{\bm{x} \sim \mathcal{V}} F(\bm{\varphi}_{\sigma}(\bm{x}))$, where $\mathcal{V}$ contains 1,280 ImageNet validation images. The authors demonstrate that white-box attributions exhibit more high-frequency content than black-box methods, largely due to max-pooling and downsampling; by filtering these frequencies, FORGrad substantially improves faithfulness, stability, and ranking of white-box methods, bridging the gap with black-box approaches while preserving computational efficiency. The findings suggest architectural factors like pooling contribute to gradient artifacts and motivate future design changes, including pooling strategies and transformer-related analyses. Overall, FORGrad enables simpler, efficient white-box explanations to compete with heavier black-box methods on XAI benchmarks, with potential implications for training-time filtering and robustness.
Abstract
Attribution methods correspond to a class of explainability methods (XAI) that aim to assess how individual inputs contribute to a model's decision-making process. We have identified a significant limitation in one type of attribution methods, known as ``white-box" methods. Although highly efficient, as we will show, these methods rely on a gradient signal that is often contaminated by high-frequency artifacts. To overcome this limitation, we introduce a new approach called "FORGrad". This simple method effectively filters out these high-frequency artifacts using optimal cut-off frequencies tailored to the unique characteristics of each model architecture. Our findings show that FORGrad consistently enhances the performance of already existing white-box methods, enabling them to compete effectively with more accurate yet computationally demanding "black-box" methods. We anticipate that our research will foster broader adoption of simpler and more efficient white-box methods for explainability, offering a better balance between faithfulness and computational efficiency.
