InfoClus: Informative Clustering of High-dimensional Data Embeddings
Fuyin Lai, Edith Heiter, Guillaume Bied, Jefrey Lijffijt
TL;DR
The paper tackles the interpretability challenge of dimensionality‑reduction embeddings by automatically partitioning the embedding into cohesive clusters with sparse, high‑dimensional explanations. It introduces InfoClus, which jointly optimizes partitioning and explanations through an explanation ratio $R_{oldsymbol{eta}}$ that trading information content against a complexity penalty, with information modeled via $D_{KL}$—Gaussian for real attributes and categorical for discrete ones. The search is constrained to partitions compatible with hierarchical clustering and is performed via a greedy, scalable procedure; experiments on Cytometry, GSE, and Mushroom data show that InfoClus yields interpretable, expert‑aligned clusters and provides a practical starting point for DR‑based scatter plot analysis. The work demonstrates scalability, analyzes hyper‑parameter effects, compares against RVX and VERA, and discusses limitations and future directions, with code to be openly available.
Abstract
Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.
