Table of Contents
Fetching ...

Visual Cluster Separation Using High-Dimensional Sharpened Dimensionality Reduction

Youngjoo Kim, Alexandru C. Telea, Scott C. Trager, Jos B. T. M. Roerdink

Abstract

Applying dimensionality reduction (DR) to large, high-dimensional data sets can be challenging when distinguishing the underlying high-dimensional data clusters in a 2D projection for exploratory analysis. We address this problem by first sharpening the clusters in the original high-dimensional data prior to the DR step using Local Gradient Clustering (LGC). We then project the sharpened data from the high-dimensional space to 2D by a user-selected DR method. The sharpening step aids this method to preserve cluster separation in the resulting 2D projection. With our method, end-users can label each distinct cluster to further analyze an otherwise unlabeled data set. Our `High-Dimensional Sharpened DR' (HD-SDR) method, tested on both synthetic and real-world data sets, is favorable to DR methods with poor cluster separation and yields a better visual cluster separation than these DR methods with no sharpening. Our method achieves good quality (measured by quality metrics) and scales computationally well with large high-dimensional data. To illustrate its concrete applications, we further apply HD-SDR on a recent astronomical catalog.

Visual Cluster Separation Using High-Dimensional Sharpened Dimensionality Reduction

Abstract

Applying dimensionality reduction (DR) to large, high-dimensional data sets can be challenging when distinguishing the underlying high-dimensional data clusters in a 2D projection for exploratory analysis. We address this problem by first sharpening the clusters in the original high-dimensional data prior to the DR step using Local Gradient Clustering (LGC). We then project the sharpened data from the high-dimensional space to 2D by a user-selected DR method. The sharpening step aids this method to preserve cluster separation in the resulting 2D projection. With our method, end-users can label each distinct cluster to further analyze an otherwise unlabeled data set. Our `High-Dimensional Sharpened DR' (HD-SDR) method, tested on both synthetic and real-world data sets, is favorable to DR methods with poor cluster separation and yields a better visual cluster separation than these DR methods with no sharpening. Our method achieves good quality (measured by quality metrics) and scales computationally well with large high-dimensional data. To illustrate its concrete applications, we further apply HD-SDR on a recent astronomical catalog.

Paper Structure

This paper contains 32 sections, 7 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Effects of parameters used in LGC. 2D Gaussian data with 10$\mathrm{K}$ observations and three clusters (a) are used to show the effects of the number of iterations ($T$) as shown in (a)--(d), number of nearest neighbors ($k_s$) in (e)--(l), and learning rate ($\alpha$) in (m)--(p). Points are color-coded based on their ground-truth labels. The cluster borders become fuzzy when using a too high $T$, as shown in (d). $k_s$ and $\alpha$ both contribute to the degree of segmentation of the clusters; without choosing an appropriate $\alpha$, $k_s$ may not significantly affect the segmentation, as shown in rows (e)--(h) and (i)--(l). Note that $\alpha$ uses a fixed range of $[0,1]$.
  • Figure 2: Effects of different parameters using 2D non-Gaussian (log-normal, $\mu=0$, and $\sigma=1$) data with 10$\mathrm{K}$ observations. The effects of the parameters are similar to those in Figure \ref{['fig:1params_gaussian']}. However, LGC with too large values of $T$ and $\alpha$ is prone to outliers (long tails), as shown in (d) and (l). This problem can be solved by setting a larger value of $k_s$.
  • Figure 3: Results of four neighborhood-based quality metrics for Banknote data: Neighborhood-hit ($Q_h$), Trustworthiness ($Q_t$), Continuity ($Q_c$), and Jaccard set distance ($Q_j$). Note that $Q_h$ is consistent with results from Figure \ref{['fig:3fdr_all_synthetic']} and best represents the visual cluster separation, whereas $Q_t$, $Q_c$, and $Q_j$ suggest the opposite. Note that $Q_t$, $Q_c$, and $Q_j$ do not consider class label information. More results including $Q_t$, $Q_c$, and $Q_j$ for the five synthetic data sets can be found in the supplemental materials.
  • Figure 4: Comparison of neighborhood-hit ($Q_h$) for sharpened data and original data of the five different types of synthetic data used in Figure \ref{['fig:3fdr_all_synthetic']}. For all synthetic data sets, $Q_h$ is always higher for the sharpened data as compared with the original data. We also note that $Q_h$ for sharpened data is higher when clusters are more separated ($\alpha=0.04$ compared with $\alpha=0.01$).
  • Figure 5: Comparison of neighborhood-hit ($Q_h$) for DR and HD-SDR of the five different types of synthetic data used for Figure \ref{['fig:3fdr_all_synthetic']}. Note that St-SNE, t-SNE, and SLMDS yield high $Q_h$-values near one for (a)--(c) and (e), which suggests that the corresponding labels of the $k$-size neighborhoods are well-preserved for HD-SDR. However, HD-SDR for sub-clustered data produces lower $Q_h$ compared with DR, as shown in (d), and this can be seen visually in Figure \ref{['fig:3fdr_all_synthetic']}(a)--(d). More results including $Q_t$, $Q_c$, and $Q_j$ for the five synthetic data sets can be found in the supplemental materials.
  • ...and 8 more figures