Table of Contents
Fetching ...

Cluster Exploration using Informative Manifold Projections

Stavros Gerolymatos, Xenophon Evangelopoulos, Vladimir Gusev, John Y. Goulermas

TL;DR

IMAPCE introduces an informative-manifold-projection framework for cluster exploration by jointly removing prior-knowledge structure and enforcing meaningful separation through a bi-objective loss on the Stiefel manifold. The method optimizes $f(\mathbf{V}) = \|\mathbf{X}-\mathbf{X}\mathbf{V}\mathbf{V}^T\|_F^2 - \alpha\|\mathbf{Y}-\mathbf{Y}\mathbf{V}\mathbf{V}^T\|_F^2 + \mu n \sum_{i=1}^{n}[\mathbf{x}_i^T\mathbf{V}(\mathbf{V}^T\mathbf{X}^T\mathbf{X}\mathbf{V})^{-1}\mathbf{V}^T\mathbf{x}_i]^2$ with $\mathbf{V}^T\mathbf{V}=\mathbf{I}$, and iteratively updates the prior data via a Dirichlet Process Gaussian Mixture Model to reveal progressively new structure. The approach is validated on synthetic data, UCI Adult data, and complex image datasets (MNIST/FMNSIT and CIFAR-100/FMNIST), outperforming ct-SNE, cPCA, and Fair-NeRV in both prior-removal and cluster-separation while remaining computationally efficient. This framework enables automated, interactive exploration of high-dimensional data by incorporating user-derived prior knowledge and guiding the discovery of latent patterns. Overall, IMAPCE advances scalable, informative visual analytics for high-dimensional clustering tasks.

Abstract

Dimensionality reduction (DR) is one of the key tools for the visual exploration of high-dimensional data and uncovering its cluster structure in two- or three-dimensional spaces. The vast majority of DR methods in the literature do not take into account any prior knowledge a practitioner may have regarding the dataset under consideration. We propose a novel method to generate informative embeddings which not only factor out the structure associated with different kinds of prior knowledge but also aim to reveal any remaining underlying structure. To achieve this, we employ a linear combination of two objectives: firstly, contrastive PCA that discounts the structure associated with the prior information, and secondly, kurtosis projection pursuit which ensures meaningful data separation in the obtained embeddings. We formulate this task as a manifold optimization problem and validate it empirically across a variety of datasets considering three distinct types of prior knowledge. Lastly, we provide an automated framework to perform iterative visual exploration of high-dimensional data.

Cluster Exploration using Informative Manifold Projections

TL;DR

IMAPCE introduces an informative-manifold-projection framework for cluster exploration by jointly removing prior-knowledge structure and enforcing meaningful separation through a bi-objective loss on the Stiefel manifold. The method optimizes with , and iteratively updates the prior data via a Dirichlet Process Gaussian Mixture Model to reveal progressively new structure. The approach is validated on synthetic data, UCI Adult data, and complex image datasets (MNIST/FMNSIT and CIFAR-100/FMNIST), outperforming ct-SNE, cPCA, and Fair-NeRV in both prior-removal and cluster-separation while remaining computationally efficient. This framework enables automated, interactive exploration of high-dimensional data by incorporating user-derived prior knowledge and guiding the discovery of latent patterns. Overall, IMAPCE advances scalable, informative visual analytics for high-dimensional clustering tasks.

Abstract

Dimensionality reduction (DR) is one of the key tools for the visual exploration of high-dimensional data and uncovering its cluster structure in two- or three-dimensional spaces. The vast majority of DR methods in the literature do not take into account any prior knowledge a practitioner may have regarding the dataset under consideration. We propose a novel method to generate informative embeddings which not only factor out the structure associated with different kinds of prior knowledge but also aim to reveal any remaining underlying structure. To achieve this, we employ a linear combination of two objectives: firstly, contrastive PCA that discounts the structure associated with the prior information, and secondly, kurtosis projection pursuit which ensures meaningful data separation in the obtained embeddings. We formulate this task as a manifold optimization problem and validate it empirically across a variety of datasets considering three distinct types of prior knowledge. Lastly, we provide an automated framework to perform iterative visual exploration of high-dimensional data.
Paper Structure (21 sections, 7 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Top row shows synthetic data experiments with information of dimensions one to four as a prior. Bottom row illustrates UCI adult data experiments with ethnicity feature as a prior. (a) cPCA embeddings clustered w.r.t. labels from dimensions one to four. (b) ct-SNE embeddings are clustered w.r.t. labels from dimensions five to six (complementary structure) with some noticeable error (overlap). (c,d) Fair-NeRV and IMAPCE embeddings are clustered w.r.t. labels from dimensions five to six (complementary structure) with clearer separation. (e) cPCA fails to separate embeddings according to their gender and income. (f) ct-SNE clearly separates embeddings w.r.t. to their income and to some extent according to their gender. (g) Fair-NeRV computes embeddings that are mostly separated w.r.t. to gender and income with the exception of some outliers, while (h) IMAPCE perfectly clusters embeddings according to their gender and income (revealing complementary structure).
  • Figure 2: Top row shows complex MNIST + FMNIST experiments using MNIST as prior (a), while bottom row complex CIFAR-100 + FMNIST experiments using CIFAR-100 as prior (f). In both cases, embeddings by cPCA (b,g), ct-SNE (c,h) and Fair-Nerve (d,i) appear significantly more mixed w.r.t their class, as opposed to the IMAPCE ones (e,j) which exhibit better segregation.
  • Figure 3: Iterative exploration of UCI image segmentation data by IMAPCE. Each column (subfigure) corresponds to an iteration of the exploration process. In every column (subfigure), the upper plot shows the data projections where prior data are colored in grey while the unexplored subset in black. The middle plot shows the clustering of the unexplored points according to DPGMM and the most distinct cluster encircled. The bottom plot illustrates the unexplored points colored according to their ground truth label.
  • Figure 4: Visualisations of UCI adult data projections computed by cPCA, ct-SNE, Fair-NeRV and IMAPCE (a), (b), (c), (d) were computed using the gender attribute as prior. (e), (f), (g), (h) were computed using the combination of gender and ethnicity attributes as prior.
  • Figure 5: Superimposed MNIST + FMNIST embeddings computed by all methods for different combinations of FMNIST classes. (a), (b), (c), (d) for "Sandal"-"Ankle boot" case, (e), (f), (g), (h) for "Tshirt"-"Dress". (i), (l) MNIST instances, (j), (m) FMNIST instances, (k), (n) their superimposition results.
  • ...and 1 more figures