Table of Contents
Fetching ...

Towards Interpretable Deep Neural Networks for Tabular Data

Khawla Elhadri, Jörg Schlötterer, Christin Seifert

Abstract

Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.

Towards Interpretable Deep Neural Networks for Tabular Data

Abstract

Tabular data is the foundation of many applications in fields such as finance and healthcare. Although DNNs tailored for tabular data achieve competitive predictive performance, they are blackboxes with little interpretability. We introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to learn a dictionary of monosemantic features within the latent space used for prediction. Using an automated method, we assign human-interpretable semantics to these features. This allows us to represent predictions as linear combinations of semantically meaningful components. Empirical evaluations demonstrate that XNNTab attains performance on par with or exceeding that of state-of-the-art, black-box neural models and classical machine learning approaches while being fully interpretable.

Paper Structure

This paper contains 14 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Architecture overview. A. An MLP learns nonlinear features, which are decomposed into monosemantic dictionary features using an SAE. B. Dictionary features are assigned a human-interpretable meaning by learning rules for the subset of training instances that highly activate a specific feature. C. Predictions are linear combinations of monosemantic features by combining the linear model components (the decoder, $M^T$, and the linear layer of the MLP, $W$).
  • Figure 2: Left: Decision weight matrix $W'$ for ADULT on interpretable features. Right: Statistics on complexity of features on ADULT and CHURN.
  • Figure 3: Fraction of features and instances covered by rules to explain ADULT at different feature activation thresholds $t$. Orange bars show the average fraction of rules extracted per active feature with the number of active features on top, blue bars show the average recall of the rules in terms of instances.