Table of Contents
Fetching ...

Enhancing Supervised Visualization through Autoencoder and Random Forest Proximities for Out-of-Sample Extension

Shuang Ni, Adrien Aumon, Guy Wolf, Kevin R. Moon, Jake S. Rhodes

TL;DR

The paper tackles the lack of out-of-sample extension in RF-PHATE by coupling geometry-regularized autoencoders with Random Forest proximities. It introduces RF-PRN architectures that either reconstruct proximities or leverage proximity information to regularize the latent space, guided by the RF-PHATE embedding $G$ through a loss $L = L_{recon} + λ L_{geom}$. Proximity-reconstruction approaches, especially RF-PRN and RF-PRN-PRO, yield embeddings that closely preserve the original RF-PHATE structure, as validated by Mantel correlations, and RF-PRN-PRO achieves substantial training-time savings via proximity prototypes while maintaining quality. Notably, the method supports semi-supervised extension, requiring no labels for out-of-sample points and performing well even with only 10% of the training data, making it scalable for large datasets. The results provide a practical pathway to extend supervised manifold embeddings efficiently and robustly in real-world visualization tasks.

Abstract

The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.

Enhancing Supervised Visualization through Autoencoder and Random Forest Proximities for Out-of-Sample Extension

TL;DR

The paper tackles the lack of out-of-sample extension in RF-PHATE by coupling geometry-regularized autoencoders with Random Forest proximities. It introduces RF-PRN architectures that either reconstruct proximities or leverage proximity information to regularize the latent space, guided by the RF-PHATE embedding through a loss . Proximity-reconstruction approaches, especially RF-PRN and RF-PRN-PRO, yield embeddings that closely preserve the original RF-PHATE structure, as validated by Mantel correlations, and RF-PRN-PRO achieves substantial training-time savings via proximity prototypes while maintaining quality. Notably, the method supports semi-supervised extension, requiring no labels for out-of-sample points and performing well even with only 10% of the training data, making it scalable for large datasets. The results provide a practical pathway to extend supervised manifold embeddings efficiently and robustly in real-world visualization tasks.

Abstract

The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.
Paper Structure (7 sections, 2 equations, 3 figures, 2 tables)

This paper contains 7 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A visual depiction of the RF-PRN-PRO architecture. Using prototypical RF-GAP proximities from the training sample, the loss function combines MSE reconstruction loss $L_{recon}$ and a geometric regularization term $L_{geom}$ involving MSE between latent representation and RF-PHATE embeddings. For out-of-sample extension, this architecture incorporates new data and proximities, using the latent representation as extended embeddings.
  • Figure 2: (Left) The mean Mantel correlations are grouped by model architecture and the regularization parameter $\lambda$. Proximity-reconstructing networks (RF-PRN and RF-PRN-PRO) tend to be more robust to the choice of $\lambda$, while the other networks give similar performance for higher $\lambda$ values. (Right) Categorical plots of each model architecture indicate that RF-PRN and RF-PRN-PRO generally produce embeddings truer to the RF-PHATE original embeddings.
  • Figure 3: Evaluation of architecture performance ($\lambda = 10$) across different training data percentages using the Fashion-MNIST dataset xiao2017fashion, with 10 repetitions. The metrics include Mantel correlation, training time, and MSE. The training time percentage is calculated as the ratio of each architecture's time to the maximum time. RF-PRN and RF-PRN-PRO consistently provide high-quality embeddings across different percentages of training data used. Although the training time for RF-PRN is higher than the other models, RF-PRN-PRO improves training time while maintaining embedding quality.