Table of Contents
Fetching ...

Preserving clusters and correlations: a dimensionality reduction method for exceptionally high global structure preservation

Jacob Gildenblat, Jens Pahnke

TL;DR

PCC introduces a global correlation objective to dimensionality reduction, explicitly aiming to preserve the global arrangement of data by aligning high- and low-dimensional distances to reference points. It couples this with a local structure objective based on clustering observability, learned via a linear classifier on the low-dimensional embedding. The method achieves state-of-the-art GS preservation, outperforms many existing DR approaches on multiple datasets, and can augment UMAP through PCUMAP or via initialization strategies, with demonstrated benefits in medical imaging contexts. The work provides a practical, simple framework for improving global fidelity in DR while maintaining competitive local clustering behavior, with clear implications for visualization and downstream analysis in life sciences and imaging.

Abstract

We present Preserving Clusters and Correlations (PCC), a novel dimensionality reduction (DR) method a novel dimensionality reduction (DR) method that achieves state-of-the-art global structure (GS) preservation while maintaining competitive local structure (LS) preservation. It optimizes two objectives: a GS preservation objective that preserves an approximation of Pearson and Spearman correlations between high- and low-dimensional distances, and an LS preservation objective that ensures clusters in the high-dimensional data are separable in the low-dimensional data. PCC has a state-of-the-art ability to preserve the GS while having competitive LS preservation. In addition, we show the correlation objective can be combined with UMAP to significantly improve its GS preservation with minimal degradation of the LS. We quantitatively benchmark PCC against existing methods and demonstrate its utility in medical imaging, and show PCC is a competitive DR technique that demonstrates superior GS preservation in our benchmarks.

Preserving clusters and correlations: a dimensionality reduction method for exceptionally high global structure preservation

TL;DR

PCC introduces a global correlation objective to dimensionality reduction, explicitly aiming to preserve the global arrangement of data by aligning high- and low-dimensional distances to reference points. It couples this with a local structure objective based on clustering observability, learned via a linear classifier on the low-dimensional embedding. The method achieves state-of-the-art GS preservation, outperforms many existing DR approaches on multiple datasets, and can augment UMAP through PCUMAP or via initialization strategies, with demonstrated benefits in medical imaging contexts. The work provides a practical, simple framework for improving global fidelity in DR while maintaining competitive local clustering behavior, with clear implications for visualization and downstream analysis in life sciences and imaging.

Abstract

We present Preserving Clusters and Correlations (PCC), a novel dimensionality reduction (DR) method a novel dimensionality reduction (DR) method that achieves state-of-the-art global structure (GS) preservation while maintaining competitive local structure (LS) preservation. It optimizes two objectives: a GS preservation objective that preserves an approximation of Pearson and Spearman correlations between high- and low-dimensional distances, and an LS preservation objective that ensures clusters in the high-dimensional data are separable in the low-dimensional data. PCC has a state-of-the-art ability to preserve the GS while having competitive LS preservation. In addition, we show the correlation objective can be combined with UMAP to significantly improve its GS preservation with minimal degradation of the LS. We quantitatively benchmark PCC against existing methods and demonstrate its utility in medical imaging, and show PCC is a competitive DR technique that demonstrates superior GS preservation in our benchmarks.

Paper Structure

This paper contains 19 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Results on Fashion MNIST. In PCC, unlike UMAP, distances between different points are meaningful since GS is preserved. Unlike PCA, clusters are separated because of the higher GS. By using a higher cluster choice like 256, we get isolated groups of points belonging to those clusters.
  • Figure 2: Comparing UMAP (an existing method) and PCUMAP and PCC (our proposed methods) on the Macosko single cell dataset macosko2015. Upper row: The transformed data is colored by labels. Bottom row: colored according to distances from a selected point, in the high dimensional data. In UMAP the points in the low-dimensional data do not preserve the original distance: many points far away are close points in the high-dimensional data. In PCUMAP and PCC this is solved.
  • Figure 3: Plotting the average performance of GS metrics against local structure metrics on 9 datasets. Our proposed methods: PCC, UMAP init+PC, PCUMAP. PCC improves global structure preservation over all other tested methods by a large margin (average of 0.83, while PCA gets 0.71 and UMAP 0.44), while being competitive in local structure preservation with graph methods that specialize in local structure (e.g, PCC gets 0.933 and UMAP 0.94). Among the modern graph methods, PaCMAP performs the best and slightly improves the global structure compared to UMAP. However, there is still room for improvement in global structure preservation, which we show is possible.
  • Figure 4: Comparing UMAP and PCC using visualizations of lipidomics MSI data of mouse brain. Both methods reduce the high-dimensional image data to 3 dimensions which are then normalized and colored as RGB images. PCC reveals numerous pathological changes, so-called $\beta$-amyloid plaques (circle and arrow, brown dots) in the isocortex (CTX) of the Alzheimer's disease mouse model, but also enables visualization of normal structures, e.g. the neuronal band of the dentate gyrus (DG, arrow, band) in the hippocampus formation (HPF). Both structures are in-detectable using UMAP visualization (left images). Legend: CTX - isocortex of the cerebrum, HPF - hippocampus formation, DG - dentate gyrus, CA1 and CA2 - cornu ammonis neurons, area 1 and 2, WM - white matter, BG - background.