Table of Contents
Fetching ...

Explaining Neural Networks without Access to Training Data

Sascha Marton, Stefan Lüdtke, Christian Bartelt, Andrej Tschalzev, Heiner Stuckenschmidt

TL;DR

This paper tackles the problem of explaining neural networks without access to training data by extending Interpretation Networks (I-Nets) to learn surrogate decision-tree models directly from network parameters. A robust data-generation scheme using multiple distributions trains I-Nets to map from network internals to standard and soft DT representations, yielding high fidelity explanations even when training data are unavailable. Empirical results across real-world datasets and a credit card default case study show that I-Nets outperform traditional sample-based distillation, often by large margins, especially for low-complexity surrogates. The work demonstrates the practical viability of reliable, data-free global explanations, with implications for privacy-sensitive and safety-constrained settings.

Abstract

We consider generating explanations for neural networks in cases where the network's training data is not accessible, for instance due to privacy or safety issues. Recently, $\mathcal{I}$-Nets have been proposed as a sample-free approach to post-hoc, global model interpretability that does not require access to training data. They formulate interpretation as a machine learning task that maps network representations (parameters) to a representation of an interpretable function. In this paper, we extend the $\mathcal{I}$-Net framework to the cases of standard and soft decision trees as surrogate models. We propose a suitable decision tree representation and design of the corresponding $\mathcal{I}$-Net output layers. Furthermore, we make $\mathcal{I}$-Nets applicable to real-world tasks by considering more realistic distributions when generating the $\mathcal{I}$-Net's training data. We empirically evaluate our approach against traditional global, post-hoc interpretability approaches and show that it achieves superior results when the training data is not accessible.

Explaining Neural Networks without Access to Training Data

TL;DR

This paper tackles the problem of explaining neural networks without access to training data by extending Interpretation Networks (I-Nets) to learn surrogate decision-tree models directly from network parameters. A robust data-generation scheme using multiple distributions trains I-Nets to map from network internals to standard and soft DT representations, yielding high fidelity explanations even when training data are unavailable. Empirical results across real-world datasets and a credit card default case study show that I-Nets outperform traditional sample-based distillation, often by large margins, especially for low-complexity surrogates. The work demonstrates the practical viability of reliable, data-free global explanations, with implications for privacy-sensitive and safety-constrained settings.

Abstract

We consider generating explanations for neural networks in cases where the network's training data is not accessible, for instance due to privacy or safety issues. Recently, -Nets have been proposed as a sample-free approach to post-hoc, global model interpretability that does not require access to training data. They formulate interpretation as a machine learning task that maps network representations (parameters) to a representation of an interpretable function. In this paper, we extend the -Net framework to the cases of standard and soft decision trees as surrogate models. We propose a suitable decision tree representation and design of the corresponding -Net output layers. Furthermore, we make -Nets applicable to real-world tasks by considering more realistic distributions when generating the -Net's training data. We empirically evaluate our approach against traditional global, post-hoc interpretability approaches and show that it achieves superior results when the training data is not accessible.
Paper Structure (33 sections, 5 equations, 14 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 14 figures, 11 tables, 1 algorithm.

Figures (14)

  • Figure 1: Explaining Neural Networks for Credit Card Default Prediction. The DT on the left is learned by a sample-based distillation without access to training data, and the DT on the right is predicted by the $\mathcal{I}$-Net. The $\mathcal{I}$-Net makes reasonable splits and achieves a significantly higher fidelity on the real data.
  • Figure 2: Sample-Based and $\boldsymbol{\mathcal{I}}$-Net Approach. Sample-based approaches query the target network based on a set of data points. Using the train data to query the network (I) usually generates a meaningful explanation. If the training data is not available, we have to query the network based on randomly sampled data, e.g., from a uniform distribution (II), which often cannot generate a meaningful explanation since relevant parts are not queried properly. The $\mathcal{I}$-Net uses the network parameters as an input to generate a reasonable explanation (III) and does not rely on querying the neural network.
  • Figure 3: Good and Bad Explanations. This figure shows an exemplary decision boundary of a bad (II) and a good (III) explanation for the model we want to interpret (I). Without considering the data (a), the explanation shown in II appears very reasonable, since the areas created by the decision boundary cover most of the decision boundary of the original model. However, when taking the data into account (b), we can see that the small area in the center of the picture is very important, since there are many samples located. This is neglected by the explanation shown in II and only considered by the explanation shown in III.
  • Figure 4: Overview of the $\boldsymbol{\mathcal{I}}$-Net Approach. The neural network parameters as input are translated into a surrogate model marton2022explanations.
  • Figure 5: Data Generation Visualization. This graphic visualizes the generation of a balanced, random dataset used for training a network $\lambda$ where $D \in \{\mathcal{U}, \mathcal{N}, \Gamma, \text{B}, \text{Poi}\}$. For each feature, a random distribution with two random parametrizations is chosen and a random number of data points is sampled from each distribution.
  • ...and 9 more figures