Table of Contents
Fetching ...

Enhanced High-Dimensional Data Visualization through Adaptive Multi-Scale Manifold Embedding

Tianhao Ni, Bingjie Li, Zhigang Yao

TL;DR

AMSME tackles high-dimensional data visualization by replacing absolute distances with ordinal rankings $o(x_i; x_j)$ and employing an adaptive multi-scale neighborhood to build a similarity graph. The method performs a two-stage nonlinear embedding, producing an initial layout $Y_1$ with pseudo-labels and a final layout $Y_2$ with enhanced inter-cluster separation via a label-driven reweighting of the distance matrix. Theoretical results show ordinal distances remain discriminative in high dimensions, and experiments on image and text datasets show consistent improvements over $t$-SNE, UMAP, and PaCMAP, with substantial gains in clustering accuracy and topology preservation. The approach also demonstrates multi-resolution analysis in scRNA-seq data, uncovering novel neuronal subtypes and associated marker genes, highlighting AMSME's practical impact for biology and beyond.

Abstract

To address the dual challenges of the curse of dimensionality and the difficulty in separating intra-cluster and inter-cluster structures in high-dimensional manifold embedding, we proposes an Adaptive Multi-Scale Manifold Embedding (AMSME) algorithm. By introducing ordinal distance to replace traditional Euclidean distances, we theoretically demonstrate that ordinal distance overcomes the constraints of the curse of dimensionality in high-dimensional spaces, effectively distinguishing heterogeneous samples. We design an adaptive neighborhood adjustment method to construct similarity graphs that simultaneously balance intra-cluster compactness and inter-cluster separability. Furthermore, we develop a two-stage embedding framework: the first stage achieves preliminary cluster separation while preserving connectivity between structurally similar clusters via the similarity graph, and the second stage enhances inter-cluster separation through a label-driven distance reweighting. Experimental results demonstrate that AMSME significantly preserves intra-cluster topological structures and improves inter-cluster separation on real-world datasets. Additionally, leveraging its multi-resolution analysis capability, AMSME discovers novel neuronal subtypes in the mouse lumbar dorsal root ganglion scRNA-seq dataset, with marker gene analysis revealing their distinct biological roles.

Enhanced High-Dimensional Data Visualization through Adaptive Multi-Scale Manifold Embedding

TL;DR

AMSME tackles high-dimensional data visualization by replacing absolute distances with ordinal rankings and employing an adaptive multi-scale neighborhood to build a similarity graph. The method performs a two-stage nonlinear embedding, producing an initial layout with pseudo-labels and a final layout with enhanced inter-cluster separation via a label-driven reweighting of the distance matrix. Theoretical results show ordinal distances remain discriminative in high dimensions, and experiments on image and text datasets show consistent improvements over -SNE, UMAP, and PaCMAP, with substantial gains in clustering accuracy and topology preservation. The approach also demonstrates multi-resolution analysis in scRNA-seq data, uncovering novel neuronal subtypes and associated marker genes, highlighting AMSME's practical impact for biology and beyond.

Abstract

To address the dual challenges of the curse of dimensionality and the difficulty in separating intra-cluster and inter-cluster structures in high-dimensional manifold embedding, we proposes an Adaptive Multi-Scale Manifold Embedding (AMSME) algorithm. By introducing ordinal distance to replace traditional Euclidean distances, we theoretically demonstrate that ordinal distance overcomes the constraints of the curse of dimensionality in high-dimensional spaces, effectively distinguishing heterogeneous samples. We design an adaptive neighborhood adjustment method to construct similarity graphs that simultaneously balance intra-cluster compactness and inter-cluster separability. Furthermore, we develop a two-stage embedding framework: the first stage achieves preliminary cluster separation while preserving connectivity between structurally similar clusters via the similarity graph, and the second stage enhances inter-cluster separation through a label-driven distance reweighting. Experimental results demonstrate that AMSME significantly preserves intra-cluster topological structures and improves inter-cluster separation on real-world datasets. Additionally, leveraging its multi-resolution analysis capability, AMSME discovers novel neuronal subtypes in the mouse lumbar dorsal root ganglion scRNA-seq dataset, with marker gene analysis revealing their distinct biological roles.

Paper Structure

This paper contains 12 sections, 2 theorems, 30 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let $x_i, x_j \sim \mathcal{N}(\mu_1, \sigma_1^2 I_d)$ be independently and identically distributed for $i \neq j$, and $y_k \sim \mathcal{N}(\mu_2, \sigma_2^2 I_d)$. If the global separability condition $\sigma_2^2 - \sigma_1^2 + \|\mu_1 - \mu_2\|^2 > 0$ holds, define the intra-cluster squared dist

Figures (6)

  • Figure 1: Overview of the AMSME Framework (see Algorithm \ref{['alg:AMSME']} for detail). First, AMSME acquires the input data's distance matrix, then constructs an ordinal distance to overcome the curse of dimensionality. Subsequently, it adaptively selects neighborhood sizes based on density variations and builds a similarity graph to weaken inter-cluster similarities while enhancing intra-cluster cohesion. Based on this graph, AMSME performs the first visualization and obtains pseudo-labels via pre-clustering. Using these labels, it amplifies inter-cluster discrepancies in the distance matrix and conducts a second visualization with the updated matrix to achieve distinct inter-cluster separation.
  • Figure 2: Probability of intra-cluster Distance Exceeding inter-cluster Distance Based on 10 Repeated Trials.
  • Figure 3: The comparison of results from the three-step similarity graphs.
  • Figure 4: Manifold embedding results for five datasets using five methods.
  • Figure 5: Bar chart of ACC results for three clustering algorithms applied to visualization results on five datasets.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Proof 1
  • Theorem 2
  • Proof 2: Proof of Theorem \ref{['thm:noise']}