Table of Contents
Fetching ...

Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids

Daniel B. Hier, Tayo Obafemi-Ajayi, Gayla R. Olbricht, Devin M. Burns, Sasha Petrenko, Donald C. Wunsch

TL;DR

Problem: interpreting the axes of dimension-reduced 2D scatter plots is challenging because the axes lack simple physical meaning. Approach: apply t-SNE to reduce 970 HPO terms to 31 phenotype categories across 235 neurogenetic disease variants, compute class centroids $(C_x, C_y)$ for three diseases and feature centroids $(F_x, F_y)$ for 31 phenotypes, and overlay these on the scatter plots; identify informative features with SHAP values from an XGBoost classifier. Findings: centroids provide a bridge to the original feature space, improving interpretability, and quadrant or centroid proximity analyses reveal relationships such as which features cluster with specific diseases. Impact: the method is simple to implement with matplotlib and can be applied to other high-dimensional biomedical datasets; data and code are publicly available on GitHub and Zenodo.

Abstract

Dimension reduction is increasingly applied to high-dimensional biomedical data to improve its interpretability. When datasets are reduced to two dimensions, each observation is assigned an x and y coordinates and is represented as a point on a scatter plot. A significant challenge lies in interpreting the meaning of the x and y axes due to the complexities inherent in dimension reduction. This study addresses this challenge by using the x and y coordinates derived from dimension reduction to calculate class and feature centroids, which can be overlaid onto the scatter plots. This method connects the low-dimension space to the original high-dimensional space. We illustrate the utility of this approach with data derived from the phenotypes of three neurogenetic diseases and demonstrate how the addition of class and feature centroids increases the interpretability of scatter plots.

Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids

TL;DR

Problem: interpreting the axes of dimension-reduced 2D scatter plots is challenging because the axes lack simple physical meaning. Approach: apply t-SNE to reduce 970 HPO terms to 31 phenotype categories across 235 neurogenetic disease variants, compute class centroids for three diseases and feature centroids for 31 phenotypes, and overlay these on the scatter plots; identify informative features with SHAP values from an XGBoost classifier. Findings: centroids provide a bridge to the original feature space, improving interpretability, and quadrant or centroid proximity analyses reveal relationships such as which features cluster with specific diseases. Impact: the method is simple to implement with matplotlib and can be applied to other high-dimensional biomedical datasets; data and code are publicly available on GitHub and Zenodo.

Abstract

Dimension reduction is increasingly applied to high-dimensional biomedical data to improve its interpretability. When datasets are reduced to two dimensions, each observation is assigned an x and y coordinates and is represented as a point on a scatter plot. A significant challenge lies in interpreting the meaning of the x and y axes due to the complexities inherent in dimension reduction. This study addresses this challenge by using the x and y coordinates derived from dimension reduction to calculate class and feature centroids, which can be overlaid onto the scatter plots. This method connects the low-dimension space to the original high-dimensional space. We illustrate the utility of this approach with data derived from the phenotypes of three neurogenetic diseases and demonstrate how the addition of class and feature centroids increases the interpretability of scatter plots.
Paper Structure (4 sections, 2 equations, 9 figures)

This paper contains 4 sections, 2 equations, 9 figures.

Figures (9)

  • Figure 1: Stacked bar chart of phenotypes (features) by disease type. Each column shows the frequency of one of the 31 available phenotype superclasses. The three most frequent phenotypes were incoordination, hyperreflexia, and eye movement abnormalities.
  • Figure 2: SHAP values used to identify the most important features predicting class membership. Bar length is proportional to influence on a class i.e. incoordination favored cerebellar ataxia; hyporeflexia: Charcot-Marie-Tooth disease; and hypertonia: hereditary spastic paraparesis. Top $10$ features were retained for further analysis.
  • Figure 3: Summary of dimension reduction strategy. The Human Phenotype Ontology (HPO) has 8743 terms; the 235 neurogenetic disease cases used 790 of these terms; subsumption reduced the phenotypes to 31 categories; UMAP and t-SNE reduced the dimensions to 2.
  • Figure 4: Scatter plot of 235 neurogenetic disease variants in two-dimensional space. The x and y coordinates were calculated by t-SNE based on the 31 phenotype features.
  • Figure 5: Same scatter plot as Fig. \ref{['fig:No_colors']} with markers colored by their ground truth labels. Note that markers form three groupings based on their class membership. There are 33 variants of Charcot-Marie-Tooth disease, 77 variants of hereditary spastic paraparesis, and 125 variants of cerebellar ataxia shown. Each variant has a distinct phenotype.
  • ...and 4 more figures