Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients

Sebastian Johann Wetzel

Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients

Sebastian Johann Wetzel

TL;DR

A unified framework for finding a closed-form interpretation of any single neuron in an artificial neural network is introduced and it is demonstrated how to interpret neural network classifiers to reveal closed-form expressions of the concepts encoded in their decision boundaries.

Abstract

I introduce a unified framework for finding a closed-form interpretation of any single neuron in an artificial neural network. Using this framework I demonstrate how to interpret neural network classifiers to reveal closed-form expressions of the concepts encoded in their decision boundaries. In contrast to neural network-based regression, for classification, it is in general impossible to express the neural network in the form of a symbolic equation even if the neural network itself bases its classification on a quantity that can be written as a closed-form equation. The interpretation framework is based on embedding trained neural networks into an equivalence class of functions that encode the same concept. I interpret these neural networks by finding an intersection between the equivalence class and human-readable equations defined by a symbolic search space. The approach is not limited to classifiers or full neural networks and can be applied to arbitrary neurons in hidden layers or latent spaces.

Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients

TL;DR

Abstract

Paper Structure (24 sections, 14 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 24 sections, 14 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Framework Overview
Interpreting an Artificial Neural Network
Equivalence Class of Functions Containing the Same Information
Equivalence of Equivalence Classes
Proof
Assumptions
Extracting Symbolic Concepts Encoded in Neural Networks
Methods
Artificial Neural Network
Symbolic Search
Symbolic Search Space
Symbolic Search Algorithm
Interpretation Algorithm
Training a Neural Network For Binary Classification
...and 9 more sections

Figures (5)

Figure 1: a: An artificial neural network is a connected graph consisting of nodes representing neurons and weighted connections between them. A neural network predicts an approximate target $F(\mathbf{x})=\hat{y} \approx y$ here in the context of binary classification. Removing the final sigmoid activation function allows the extraction of the latent model $f$ from the full neural network $F$ for easier interpretability. b: The interpretation framework is based on finding the intersection between human-readable functions and the equivalence class \ref{['eq:H_g']} of functions that contain the same information as an output neuron of the neural network. The space of human-readable functions can be defined through a symbolic search space with elementary functions and complexity that matches the user's knowledge. This space can be computationally explored by genetic algorithms whose structure is mathematically represented by c: a symbolic tree. A tree consists of connected nodes containing variables, numeric parameters, unary and binary operators. Symbolic search is performed by a genetic algorithm that modifies, evolves and adds nodes to optimize an objective function on some underlying training data.
Figure 2: a: Two class data of Experiment 1 separated by decision boundary $g(\mathbf{x})=x_1^2+2x_2^2=1$. A neural network $F$ is trained to classify the data. Afterward, a symbolic model $T$ is trained to reproduce the normalized gradients of $F$ which coincide with the normalized gradients of function that defines the decision boundary $g$. b: Empirical correlation between true function $g$ and the neural network $F$. Removing the sigmoid activation function from $F$ defines the latent model $f$ which has an almost linear correlation with $g$. However, this correlation is not linear and defines the function $f=\phi(g)$ with which I ascertain the equivalence relation $f\sim g$ assuring that $F$ and $f$ contain the same information as $g$ and thus $F,f\in\tilde{H}_g$ according to eq.\ref{['eq:H_gtilde']}.
Figure 3: The results of fitting a symbolic model $T$ to the normalized gradients of the neural network are presented along the Pareto front. The Pareto front collects several possible results with decreasing Mean Square Error (MSE) and increasing complexity. The closest match to the true underlying function is often found at the point of steepest change of the Pareto front.
Figure 4: a: Results of fitting a symbolic classification model to the data of experiment 7. b: Interpretation of a neural network classifying the same data set. c: Empirical correlation between symbolic classification, the proposed interpretation method, and latent model $f$. Symbolic classification learns a different high-level feature than the neural network. The interpretation framework presented in this paper correctly interprets the neural network.
Figure 5: The results of fitting a symbolic classification model $T$ to six experiments. The Pareto front collects several possible results with decreasing Mean Square Error (MSE) and increasing complexity. The closest match to the true underlying function is often found at the point of steepest change of the Pareto front.

Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients

TL;DR

Abstract

Closed-Form Interpretation of Neural Network Classifiers with Symbolic Gradients

Authors

TL;DR

Abstract

Table of Contents

Figures (5)