Table of Contents
Fetching ...

InfoClus: Informative Clustering of High-dimensional Data Embeddings

Fuyin Lai, Edith Heiter, Guillaume Bied, Jefrey Lijffijt

TL;DR

The paper tackles the interpretability challenge of dimensionality‑reduction embeddings by automatically partitioning the embedding into cohesive clusters with sparse, high‑dimensional explanations. It introduces InfoClus, which jointly optimizes partitioning and explanations through an explanation ratio $R_{oldsymbol{eta}}$ that trading information content against a complexity penalty, with information modeled via $D_{KL}$—Gaussian for real attributes and categorical for discrete ones. The search is constrained to partitions compatible with hierarchical clustering and is performed via a greedy, scalable procedure; experiments on Cytometry, GSE, and Mushroom data show that InfoClus yields interpretable, expert‑aligned clusters and provides a practical starting point for DR‑based scatter plot analysis. The work demonstrates scalability, analyzes hyper‑parameter effects, compares against RVX and VERA, and discusses limitations and future directions, with code to be openly available.

Abstract

Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.

InfoClus: Informative Clustering of High-dimensional Data Embeddings

TL;DR

The paper tackles the interpretability challenge of dimensionality‑reduction embeddings by automatically partitioning the embedding into cohesive clusters with sparse, high‑dimensional explanations. It introduces InfoClus, which jointly optimizes partitioning and explanations through an explanation ratio that trading information content against a complexity penalty, with information modeled via —Gaussian for real attributes and categorical for discrete ones. The search is constrained to partitions compatible with hierarchical clustering and is performed via a greedy, scalable procedure; experiments on Cytometry, GSE, and Mushroom data show that InfoClus yields interpretable, expert‑aligned clusters and provides a practical starting point for DR‑based scatter plot analysis. The work demonstrates scalability, analyzes hyper‑parameter effects, compares against RVX and VERA, and discusses limitations and future directions, with code to be openly available.

Abstract

Developing an understanding of high-dimensional data can be facilitated by visualizing that data using dimensionality reduction. However, the low-dimensional embeddings are often difficult to interpret. To facilitate the exploration and interpretation of low-dimensional embeddings, we introduce a new concept named partitioning with explanations. The idea is to partition the data shown through the embedding into groups, each of which is given a sparse explanation using the original high-dimensional attributes. We introduce an objective function that quantifies how much we can learn through observing the explanations of the data partitioning, using information theory, and also how complex the explanations are. Through parameterization of the complexity, we can tune the solutions towards the desired granularity. We propose InfoClus, which optimizes the partitioning and explanations jointly, through greedy search constrained over a hierarchical clustering. We conduct a qualitative and quantitative analysis of InfoClus on three data sets. We contrast the results on the Cytometry data with published manual analysis results, and compare with two other recent methods for explaining embeddings (RVX and VERA). These comparisons highlight that InfoClus has distinct advantages over existing procedures and methods. We find that InfoClus can automatically create good starting points for the analysis of dimensionality-reduction-based scatter plots.

Paper Structure

This paper contains 16 sections, 4 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: $\texttt{InfoClus}$ workflow
  • Figure 2: Example dataset.
  • Figure 3: How hierarchical clustering works as a candidate partitioning generator.
  • Figure 4: $\texttt{Cytometry}$ dataset. Panels a1 and a2 show t-SNE embeddings colored by manual gating saeys2016 and by $\texttt{InfoClus}$ respectively. Panel b1 shows how manual gating proceeds to label cells in a2. Panel b2 displays the explanations selected by $\texttt{InfoClus}$ for each cluster (the color coding maps explanations to their respective clusters). Each plot displays Kernel Density Estimates (KDE) of the distributions of the attribute in the cluster (dotted line) and on the full dataset (solid line). Filled color represents a KDE of the cluster scaled by how many points are covered by the cluster in the full data.
  • Figure 5: Results of other methods aimed at explaining embeddings on $\texttt{Cytometry}$
  • ...and 6 more figures

Theorems & Definitions (4)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition