Cluster Exploration using Informative Manifold Projections
Stavros Gerolymatos, Xenophon Evangelopoulos, Vladimir Gusev, John Y. Goulermas
TL;DR
IMAPCE introduces an informative-manifold-projection framework for cluster exploration by jointly removing prior-knowledge structure and enforcing meaningful separation through a bi-objective loss on the Stiefel manifold. The method optimizes $f(\mathbf{V}) = \|\mathbf{X}-\mathbf{X}\mathbf{V}\mathbf{V}^T\|_F^2 - \alpha\|\mathbf{Y}-\mathbf{Y}\mathbf{V}\mathbf{V}^T\|_F^2 + \mu n \sum_{i=1}^{n}[\mathbf{x}_i^T\mathbf{V}(\mathbf{V}^T\mathbf{X}^T\mathbf{X}\mathbf{V})^{-1}\mathbf{V}^T\mathbf{x}_i]^2$ with $\mathbf{V}^T\mathbf{V}=\mathbf{I}$, and iteratively updates the prior data via a Dirichlet Process Gaussian Mixture Model to reveal progressively new structure. The approach is validated on synthetic data, UCI Adult data, and complex image datasets (MNIST/FMNSIT and CIFAR-100/FMNIST), outperforming ct-SNE, cPCA, and Fair-NeRV in both prior-removal and cluster-separation while remaining computationally efficient. This framework enables automated, interactive exploration of high-dimensional data by incorporating user-derived prior knowledge and guiding the discovery of latent patterns. Overall, IMAPCE advances scalable, informative visual analytics for high-dimensional clustering tasks.
Abstract
Dimensionality reduction (DR) is one of the key tools for the visual exploration of high-dimensional data and uncovering its cluster structure in two- or three-dimensional spaces. The vast majority of DR methods in the literature do not take into account any prior knowledge a practitioner may have regarding the dataset under consideration. We propose a novel method to generate informative embeddings which not only factor out the structure associated with different kinds of prior knowledge but also aim to reveal any remaining underlying structure. To achieve this, we employ a linear combination of two objectives: firstly, contrastive PCA that discounts the structure associated with the prior information, and secondly, kurtosis projection pursuit which ensures meaningful data separation in the obtained embeddings. We formulate this task as a manifold optimization problem and validate it empirically across a variety of datasets considering three distinct types of prior knowledge. Lastly, we provide an automated framework to perform iterative visual exploration of high-dimensional data.
