Table of Contents
Fetching ...

Cluster-based multidimensional scaling embedding tool for data visualization

Patricia Hernández-León, Miguel A. Caro

TL;DR

The paper addresses the challenge of visualizing high-dimensional data by preserving local and global structures in a single $2$-D embedding. It introduces cluster MDS (cl-MDS), which first computes $N_\text{cl}$ local MDS embeddings on $k$-medoids clusters, then selects up to four anchor points per cluster to define a global anchor map via MDS, and finally merges the two via per-cluster affine or projective transformations. The approach is enhanced with a hierarchical embedding option and sparsification to scale to very large datasets, including atomic-structure datasets using SOAP descriptors. Demonstrations on CHO, QM9, and PtAu nanoparticle data show improved visualization of multi-scale locality and meaningful medoid-based interpretation compared to standard methods such as PCA, Isomap, t-SNE, and UMAP, with practical benefits for materials science and chemistry workloads.

Abstract

We present a new technique for visualizing high-dimensional data called cluster MDS (cl-MDS), which addresses a common difficulty of dimensionality reduction methods: preserving both local and global structures of the original sample in a single 2-dimensional visualization. Its algorithm combines the well-known multidimensional scaling (MDS) tool with the $k$-medoids data clustering technique, and enables hierarchical embedding, sparsification and estimation of 2-dimensional coordinates for additional points. While cl-MDS is a generally applicable tool, we also include specific recipes for atomic structure applications. We apply this method to non-linear data of increasing complexity where different layers of locality are relevant, showing a clear improvement in their retrieval and visualization quality.

Cluster-based multidimensional scaling embedding tool for data visualization

TL;DR

The paper addresses the challenge of visualizing high-dimensional data by preserving local and global structures in a single -D embedding. It introduces cluster MDS (cl-MDS), which first computes local MDS embeddings on -medoids clusters, then selects up to four anchor points per cluster to define a global anchor map via MDS, and finally merges the two via per-cluster affine or projective transformations. The approach is enhanced with a hierarchical embedding option and sparsification to scale to very large datasets, including atomic-structure datasets using SOAP descriptors. Demonstrations on CHO, QM9, and PtAu nanoparticle data show improved visualization of multi-scale locality and meaningful medoid-based interpretation compared to standard methods such as PCA, Isomap, t-SNE, and UMAP, with practical benefits for materials science and chemistry workloads.

Abstract

We present a new technique for visualizing high-dimensional data called cluster MDS (cl-MDS), which addresses a common difficulty of dimensionality reduction methods: preserving both local and global structures of the original sample in a single 2-dimensional visualization. Its algorithm combines the well-known multidimensional scaling (MDS) tool with the -medoids data clustering technique, and enables hierarchical embedding, sparsification and estimation of 2-dimensional coordinates for additional points. While cl-MDS is a generally applicable tool, we also include specific recipes for atomic structure applications. We apply this method to non-linear data of increasing complexity where different layers of locality are relevant, showing a clear improvement in their retrieval and visualization quality.
Paper Structure (14 sections, 8 equations, 9 figures)

This paper contains 14 sections, 8 equations, 9 figures.

Figures (9)

  • Figure 1: Steps of the cl-MDS algorithm: (1) $k$-medoids clustering of the data; (2) MDS-based local embedding of the individual clusters; (3) anchor-point selection within the individual clusters; (4) MDS-based global embedding of the anchor points only; (5,6) global embedding of all data points based on transformations derived from (2,4).
  • Figure 2: Illustration of the hierarchical cluster setup, using a $[5,2,1]$ hierarchy. The figure includes the set of anchor points obtained per clustering level, following step \ref{['step:3']}. The last level does not require such set since it corresponds to the final embedding.
  • Figure 3: Effect of the number of clusters $N_\text{cl}$ in cl-MDS embedding and performance for a simple example.
  • Figure 4: Comparison of several dimensionality reduction techniques applied to an S-curve manifold with 1000 points. Like the original example S_example, we used minimal parameters whenever possible. The top two rows include those methods that require a fixed number of neighbors, in this case limited to 15.
  • Figure 5: Simple example of cl-MDS applied to a high-dimensional dataset. The sample consists of 1000 points distributed in $\mathbb{R}^2$ as shown on panel (a), where $N_h = 12$. Their high-dimensional representation is obtained from the pairwise distances to each hole, illustrated on panel (b). Panels (c) and (d) show the cl-MDS embedding and the Voronoi diagram of the medoids, respectively.
  • ...and 4 more figures