Table of Contents
Fetching ...

UMATO: Bridging Local and Global Structures for Reliable Visual Analytics with Dimensionality Reduction

Hyeon Jeon, Kwon Ko, Soohyun Lee, Jake Hyun, Taehyun Yang, Gyehun Go, Jaemin Jo, Jinwook Seo

TL;DR

UMATO tackles the reliability problem in dimensionality reduction by introducing a two-phase optimization that first builds a global skeletal projection using hub points and then embeds the remaining data to preserve local structure. By splitting the optimization, UMATO achieves state-of-the-art global-structure preservation while maintaining competitive local fidelity, and it demonstrates superior scalability and stability against subsampling and initialization compared with existing methods like UMAP, PacMAP, and Trimap. The approach leverages a kNN-based hub classification, PCA-based hub initialization, and a targeted loss to balance global and local objectives, with DCPs arranged near their NN centroids to minimize distortion. Extensive quantitative and qualitative evaluations on real-world and synthetic datasets show UMATO’s potential to enhance reliable visual analytics in high-dimensional data, complemented by open-source software and practical hyperparameter guidance.

Abstract

Due to the intrinsic complexity of high-dimensional (HD) data, dimensionality reduction (DR) techniques cannot preserve all the structural characteristics of the original data. Therefore, DR techniques focus on preserving either local neighborhood structures (local techniques) or global structures such as pairwise distances between points (global techniques). However, both approaches can mislead analysts to erroneous conclusions about the overall arrangement of manifolds in HD data. For example, local techniques may exaggerate the compactness of individual manifolds, while global techniques may fail to separate clusters that are well-separated in the original space. In this research, we provide a deeper insight into Uniform Manifold Approximation with Two-phase Optimization (UMATO), a DR technique that addresses this problem by effectively capturing local and global structures. UMATO achieves this by dividing the optimization process of UMAP into two phases. In the first phase, it constructs a skeletal layout using representative points, and in the second phase, it projects the remaining points while preserving the regional characteristics. Quantitative experiments validate that UMATO outperforms widely used DR techniques, including UMAP, in terms of global structure preservation, with a slight loss in local structure. We also confirm that UMATO outperforms baseline techniques in terms of scalability and stability against initialization and subsampling, making it more effective for reliable HD data analysis. Finally, we present a case study and a qualitative demonstration that highlight UMATO's effectiveness in generating faithful projections, enhancing the overall reliability of visual analytics using DR.

UMATO: Bridging Local and Global Structures for Reliable Visual Analytics with Dimensionality Reduction

TL;DR

UMATO tackles the reliability problem in dimensionality reduction by introducing a two-phase optimization that first builds a global skeletal projection using hub points and then embeds the remaining data to preserve local structure. By splitting the optimization, UMATO achieves state-of-the-art global-structure preservation while maintaining competitive local fidelity, and it demonstrates superior scalability and stability against subsampling and initialization compared with existing methods like UMAP, PacMAP, and Trimap. The approach leverages a kNN-based hub classification, PCA-based hub initialization, and a targeted loss to balance global and local objectives, with DCPs arranged near their NN centroids to minimize distortion. Extensive quantitative and qualitative evaluations on real-world and synthetic datasets show UMATO’s potential to enhance reliable visual analytics in high-dimensional data, complemented by open-source software and practical hyperparameter guidance.

Abstract

Due to the intrinsic complexity of high-dimensional (HD) data, dimensionality reduction (DR) techniques cannot preserve all the structural characteristics of the original data. Therefore, DR techniques focus on preserving either local neighborhood structures (local techniques) or global structures such as pairwise distances between points (global techniques). However, both approaches can mislead analysts to erroneous conclusions about the overall arrangement of manifolds in HD data. For example, local techniques may exaggerate the compactness of individual manifolds, while global techniques may fail to separate clusters that are well-separated in the original space. In this research, we provide a deeper insight into Uniform Manifold Approximation with Two-phase Optimization (UMATO), a DR technique that addresses this problem by effectively capturing local and global structures. UMATO achieves this by dividing the optimization process of UMAP into two phases. In the first phase, it constructs a skeletal layout using representative points, and in the second phase, it projects the remaining points while preserving the regional characteristics. Quantitative experiments validate that UMATO outperforms widely used DR techniques, including UMAP, in terms of global structure preservation, with a slight loss in local structure. We also confirm that UMATO outperforms baseline techniques in terms of scalability and stability against initialization and subsampling, making it more effective for reliable HD data analysis. Finally, we present a case study and a qualitative demonstration that highlight UMATO's effectiveness in generating faithful projections, enhancing the overall reliability of visual analytics using DR.

Paper Structure

This paper contains 32 sections, 10 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: The comparison between the pipelines of UMAP and UMATO. Based on a given HD data, UMATO first constructs a $k$NN graph and classifies points into three groups (hubs, extended nearest neighbors or eNNs, and disconnected points or DCPs) using the $k$NN indices (a). In the layout optimization stage, hubs, eNNs, and DCPs are embedded separately in order (b-d). Note that UMATO also starts by initializing hubs, but we omit this in the figure. The separation of optimization enhances UMAP's stability and accuracy in preserving global structure. In contrast, UMAP does not classify points and optimizes every point together, compromising its stability and precision in maintaining the global structure (e-h).
  • Figure 2: DR techniques ranked by local (a) and global (b) quality metrics in accuracy analysis (\ref{['sec:accuexp']}, \ref{['tab:accuracy']}). Among the ten techniques we compared, UMATO demonstrated the highest accuracy in terms of global metrics and showed intermediate performance for local metrics. The error bars depict 95% confidence intervals. Please refer to \ref{['tab:accuracy']} for the detailed statistics.
  • Figure 3: The subset of the projections generated in our accuracy analysis (\ref{['sec:accuexp']}). Colors depict the class label of each dataset. The analysis results verified that UMATO outperforms competitors in terms of accurately preserving global structure while maintaining competitive performance in depicting local structure. Note that we only depict the projections made by default configurations for UMATO and UMAP.
  • Figure 4: The results of the scalability analysis with small datasets (\ref{['sec:scalsmall']}). Note that LAMP and MDS have been removed from the figure as they need substantially longer computation time, making the runtime of all other techniques look similar. UMATO takes about three seconds on average to generate projections, outperforming all other nonlinear DR techniques. The error bars depict confidence intervals (95%).
  • Figure 5: The results of the scalability analysis with large datasets (\ref{['sec:scalexp']}). Overall, UMATO is on par with UMAP and outperforms every competitor except PCA. The regression line is fitted to the $y=a\cdot x\log x + b$ function, following the time complexity of UMATO, UMAP, and its variants (\ref{['sec:complexity']}). LLE implementation is not depicted here as it requires more than 5,000 seconds to compute the smallest sampled subset of the data.
  • ...and 9 more figures