Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids
Daniel B. Hier, Tayo Obafemi-Ajayi, Gayla R. Olbricht, Devin M. Burns, Sasha Petrenko, Donald C. Wunsch
TL;DR
Problem: interpreting the axes of dimension-reduced 2D scatter plots is challenging because the axes lack simple physical meaning. Approach: apply t-SNE to reduce 970 HPO terms to 31 phenotype categories across 235 neurogenetic disease variants, compute class centroids $(C_x, C_y)$ for three diseases and feature centroids $(F_x, F_y)$ for 31 phenotypes, and overlay these on the scatter plots; identify informative features with SHAP values from an XGBoost classifier. Findings: centroids provide a bridge to the original feature space, improving interpretability, and quadrant or centroid proximity analyses reveal relationships such as which features cluster with specific diseases. Impact: the method is simple to implement with matplotlib and can be applied to other high-dimensional biomedical datasets; data and code are publicly available on GitHub and Zenodo.
Abstract
Dimension reduction is increasingly applied to high-dimensional biomedical data to improve its interpretability. When datasets are reduced to two dimensions, each observation is assigned an x and y coordinates and is represented as a point on a scatter plot. A significant challenge lies in interpreting the meaning of the x and y axes due to the complexities inherent in dimension reduction. This study addresses this challenge by using the x and y coordinates derived from dimension reduction to calculate class and feature centroids, which can be overlaid onto the scatter plots. This method connects the low-dimension space to the original high-dimensional space. We illustrate the utility of this approach with data derived from the phenotypes of three neurogenetic diseases and demonstrate how the addition of class and feature centroids increases the interpretability of scatter plots.
