Full Bayesian Significance Testing for Neural Networks
Zehua Liu, Zimeng Li, Jingyuan Wang, Yue He
TL;DR
This work addresses the challenge of significance testing in nonlinear, high-dimensional settings by introducing nFBST, a Full Bayesian Significance Testing framework for neural networks. It leverages Bayesian neural networks to fit the data-generating process and derives the posterior of a testing statistic $\eta(\theta)$, using KDE and Monte Carlo to compute the Bayesian evidence $Ev(H_0)$ for $H_0: \eta(\theta)=0$, without requiring a fixed form for $f_0$. The framework supports global, local, and instance-wise significance through multiple testing statistics (Grad-, LRP-, DeepLIFT-, LIME-nFBST) and a quantile-based global test (Q-GS). Across toy, simulation, and real-data experiments, nFBST consistently identifies significant features and provides interpretable instance-wise insights, outperforming classical methods. This approach offers a general, extensible paradigm for rigorous significance testing in nonlinear neural models with practical implications for scientific inference and model interpretability.
Abstract
Significance testing aims to determine whether a proposition about the population distribution is the truth or not given observations. However, traditional significance testing often needs to derive the distribution of the testing statistic, failing to deal with complex nonlinear relationships. In this paper, we propose to conduct Full Bayesian Significance Testing for neural networks, called \textit{n}FBST, to overcome the limitation in relationship characterization of traditional approaches. A Bayesian neural network is utilized to fit the nonlinear and multi-dimensional relationships with small errors and avoid hard theoretical derivation by computing the evidence value. Besides, \textit{n}FBST can test not only global significance but also local and instance-wise significance, which previous testing methods don't focus on. Moreover, \textit{n}FBST is a general framework that can be extended based on the measures selected, such as Grad-\textit{n}FBST, LRP-\textit{n}FBST, DeepLIFT-\textit{n}FBST, LIME-\textit{n}FBST. A range of experiments on both simulated and real data are conducted to show the advantages of our method.
