Assessing and improving reliability of neighbor embedding methods: a map-continuity perspective
Zhexuan Liu, Rong Ma, Yiqiao Zhong
TL;DR
This work tackles the reliability of neighbor-embedding visualizations (e.g., t-SNE, UMAP) by reframing embeddings through a data-independent, continuous map via leave-one-out analysis. It introduces the LOO-map, two forms of map discontinuity (OI and FI), and two label-free scores (perturbation and singularity) to diagnose global and local distortions and to guide hyperparameter tuning. Empirical validation across synthetic and real datasets demonstrates that the LOO assumption holds and that the proposed scores effectively detect topology changes and inform perplexity choices, improving interpretability in biomedical and computer-vision contexts. The approach yields a practical, wrapper-based toolkit (with an accompanying R package) to produce more faithful visualizations and robust hyperparameter selection for high-dimensional data visualizations.
Abstract
Visualizing high-dimensional data is essential for understanding biomedical data and deep learning models. Neighbor embedding methods, such as t-SNE and UMAP, are widely used but can introduce misleading visual artifacts. We find that the manifold learning interpretations from many prior works are inaccurate and that the misuse stems from a lack of data-independent notions of embedding maps, which project high-dimensional data into a lower-dimensional space. Leveraging the leave-one-out principle, we introduce LOO-map, a framework that extends embedding maps beyond discrete points to the entire input space. We identify two forms of map discontinuity that distort visualizations: one exaggerates cluster separation and the other creates spurious local structures. As a remedy, we develop two types of point-wise diagnostic scores to detect unreliable embedding points and improve hyperparameter selection, which are validated on datasets from computer vision and single-cell omics.
