Table of Contents
Fetching ...

Interpretable Visualizations of Data Spaces for Classification Problems

Christian Jorgensen, Arthur Y. Lin, Rhushil Vasavada, Rose K. Cersonsky

TL;DR

This work introduces principal covariates classification (PCovC), a hybrid supervised-unsupervised mapping technique designed to visualize and interpret classification decision boundaries in data spaces. By integrating a classifier-derived evidence matrix into a PCovR-like framework, PCovC yields low-dimensional latent spaces that reflect both data structure and classification performance, enabling qualitative and quantitative analysis of boundaries across diverse domains. Through case studies in neurotoxicity, organosulfur spectroscopy, inorganic materials, and MNIST, the authors demonstrate improved boundary delineation, robust interpretability of feature influence, and practical benefits for downstream tasks like MNIST preprocessing before nonlinear embeddings. The approach offers a general, scalable pathway to unbox machine-learning decisions in chemistry and related fields, with an open-source implementation in scikit-matter.

Abstract

How do classification models "see" our data? Based on their success in delineating behaviors, there must be some lens through which it is easy to see the boundary between classes; however, our current set of visualization techniques makes this prospect difficult. In this work, we propose a hybrid supervised-unsupervised technique distinctly suited to visualizing the decision boundaries determined by classification problems. This method provides a human-interpretable map that can be analyzed qualitatively and quantitatively, which we demonstrate through visualizing and interpreting a decision boundary for chemical neurotoxicity. While we discuss this method in the context of chemistry-driven problems, its application can be generalized across subfields for "unboxing" the operations of machine-learning classification models.

Interpretable Visualizations of Data Spaces for Classification Problems

TL;DR

This work introduces principal covariates classification (PCovC), a hybrid supervised-unsupervised mapping technique designed to visualize and interpret classification decision boundaries in data spaces. By integrating a classifier-derived evidence matrix into a PCovR-like framework, PCovC yields low-dimensional latent spaces that reflect both data structure and classification performance, enabling qualitative and quantitative analysis of boundaries across diverse domains. Through case studies in neurotoxicity, organosulfur spectroscopy, inorganic materials, and MNIST, the authors demonstrate improved boundary delineation, robust interpretability of feature influence, and practical benefits for downstream tasks like MNIST preprocessing before nonlinear embeddings. The approach offers a general, scalable pathway to unbox machine-learning decisions in chemistry and related fields, with an open-source implementation in scikit-matter.

Abstract

How do classification models "see" our data? Based on their success in delineating behaviors, there must be some lens through which it is easy to see the boundary between classes; however, our current set of visualization techniques makes this prospect difficult. In this work, we propose a hybrid supervised-unsupervised technique distinctly suited to visualizing the decision boundaries determined by classification problems. This method provides a human-interpretable map that can be analyzed qualitatively and quantitatively, which we demonstrate through visualizing and interpreting a decision boundary for chemical neurotoxicity. While we discuss this method in the context of chemistry-driven problems, its application can be generalized across subfields for "unboxing" the operations of machine-learning classification models.

Paper Structure

This paper contains 25 sections, 10 equations, 9 figures.

Figures (9)

  • Figure 1: Relationship between variables and projectors in principal covariates classification.$\mathbf{Z}$ is the evidence matrix, which is a quantification of class likelihoods, and it can be a tensor in the case of a multilabel, multiclass classification problem.
  • Figure 2: PCovC maps will change based on the underlying classifier. PCovC ($\alpha=0.5$) maps for the Iris dataset fisher_use_1936anderson_species_1936 with ridge classification, logistic regression, support vector classification, and a single-layer perceptron. Color corresponds to the class (red, green, and blue for setosa, versicolor, and virginica flower classes) and the background shows the estimated decision boundaries.
  • Figure 3: Effect of mixing parameter $\alpha$ on the 2D map and classification performance (on an out-of-sample dataset) for the "tox21-ache-p3 (-1)" assay. White and blue points indicate "non-toxic" and "toxic" molecules, respectively. Opacity denotes train (translucent)/ test (opaque) split. In each panel is a confusion matrix showing the accuracy of logistic regression on the testing data as they appear in the resulting map, where "TN" indicates the number of "true negatives", "FP" indicates the number of "false positives", and so on. For comparison, logistic regression on the full-dimensional data obtains a confusion matrix of TP=17, TN=505, FP=9, FN=19.
  • Figure 4: Molecules from the training set near the decision boundary that demonstrate delineating features of toxicity based on the "tox21-ache-p3 (-1)" assay. The underlying map corresponds to PCovC at $\alpha=0.05$; markers denote different pairs of molecules, with blue and white markers corresponding to toxic and non-toxic molecules. Toxic/non-toxic molecule pairs were chosen to correspond to the molecules closest in PCovC space.
  • Figure 5: Maps of the Tox21 dataset considering multiple assays for classification, two double-assay maps (left, center) and a triple-assay map (right). Each plot is given for the $\alpha$ that corresponds to the best classification accuracy for the corresponding assays. Marker color corresponds to the values for the three assays, shown in the table to the right. Opacity corresponds to the train/test split.
  • ...and 4 more figures