The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction
Mohammad Tariqul Islam, Jason W. Fleischer
TL;DR
The paper addresses how attraction and repulsion forces govern UMAP embeddings by decomposing low-dimensional updates into shapes $f_a(\zeta)$ and $f_r(\zeta)$ with $\zeta=||y_i-y_j||_2$. It develops analytic results (including contraction/expansion conditions) and derives UMAP-specific shapes, showing that attraction can cause both contraction and expansion and that learning-rate annealing helps drive clusters to concise boundaries, while repulsion modulates inter-cluster distances. It demonstrates that modifying the attraction shape improves consistency under random initialization (via Procrustes analysis on MNIST and other datasets) and analyzes how attraction and repulsion shapes interact to form clusters, comparing UMAP with NEG-$t$-SNE, PaCMAP, TriMap, and LocalMAP. The findings offer a principled lens to interpret and improve DR algorithms, suggesting that shape-mixing and far-distance attraction can enhance robustness and accuracy, with implications for contrastive and representation-learning contexts.
Abstract
Uniform manifold approximation and projection (UMAP) is among the most popular neighbor embedding methods. The method relies on attractive and repulsive forces among high-dimensional data points to obtain a low-dimensional embedding. In this paper, we analyze the forces to reveal their effects on cluster formations and visualization and compare UMAP to its contemporaries. Repulsion emphasizes differences, controlling cluster boundaries and inter-cluster distance. Attraction is more subtle, as attractive tension between points can manifest simultaneously as attraction and repulsion in the lower-dimensional mapping. This explains the need for learning rate annealing and motivates the different treatments between attractive and repulsive terms. Moreover, by modifying attraction, we improve the consistency of cluster formation under random initialization. Overall, our analysis makes UMAP and similar embedding methods more interpretable, more robust, and more accurate.
