Table of Contents
Fetching ...

ClusterGraph: a new tool for visualization and compression of multidimensional data

Paweł Dłotko, Davide Gurnari, Mathis Hallier, Anna Jurek-Loughrey

TL;DR

ClusterGraph, an additional layer on the output of any clustering algorithm, provides information about the global layout of clusters, obtained from the considered clustering algorithm, and can be visualized and used in synergy with state of the art exploratory data analysis techniques.

Abstract

Understanding the global organization of complicated and high dimensional data is of primary interest for many branches of applied sciences. It is typically achieved by applying dimensionality reduction techniques mapping the considered data into lower dimensional space. This family of methods, while preserving local structures and features, often misses the global structure of the dataset. Clustering techniques are another class of methods operating on the data in the ambient space. They group together points that are similar according to a fixed similarity criteria, however unlike dimensionality reduction techniques, they do not provide information about the global organization of the data. Leveraging ideas from Topological Data Analysis, in this paper we provide an additional layer on the output of any clustering algorithm. Such data structure, ClusterGraph, provides information about the global layout of clusters, obtained from the considered clustering algorithm. Appropriate measures are provided to assess the quality and usefulness of the obtained representation. Subsequently the ClusterGraph, possibly with an appropriate structure--preserving simplification, can be visualized and used in synergy with state of the art exploratory data analysis techniques.

ClusterGraph: a new tool for visualization and compression of multidimensional data

TL;DR

ClusterGraph, an additional layer on the output of any clustering algorithm, provides information about the global layout of clusters, obtained from the considered clustering algorithm, and can be visualized and used in synergy with state of the art exploratory data analysis techniques.

Abstract

Understanding the global organization of complicated and high dimensional data is of primary interest for many branches of applied sciences. It is typically achieved by applying dimensionality reduction techniques mapping the considered data into lower dimensional space. This family of methods, while preserving local structures and features, often misses the global structure of the dataset. Clustering techniques are another class of methods operating on the data in the ambient space. They group together points that are similar according to a fixed similarity criteria, however unlike dimensionality reduction techniques, they do not provide information about the global organization of the data. Leveraging ideas from Topological Data Analysis, in this paper we provide an additional layer on the output of any clustering algorithm. Such data structure, ClusterGraph, provides information about the global layout of clusters, obtained from the considered clustering algorithm. Appropriate measures are provided to assess the quality and usefulness of the obtained representation. Subsequently the ClusterGraph, possibly with an appropriate structure--preserving simplification, can be visualized and used in synergy with state of the art exploratory data analysis techniques.

Paper Structure

This paper contains 18 sections, 3 theorems, 14 equations, 8 figures.

Key Result

Proposition 1

Let $X$ be a dataset and $\mathcal{C}(X)$, $\mathcal{D}(X)$ be two partitions of it such that the diameter of each set in $\mathcal{C}(X)$ and $\mathcal{D}(X)$ is at most $\delta$. Then, for any cluster $C_i \in \mathcal{C}(X)$, its image in $\mathcal{D}(X)$ has diameter at most $3\delta$.

Figures (8)

  • Figure 1: The dataset consisting of four clusters $0$ (blue), $1$ (orange), $2$ (green) and $3$ (purple), as described in the text, so that elements of cluster $0$ are distance one from elements from the remaining clusters and the mutual distances between elements of clusters $1, 2, 3$ are two. Such a dataset cannot be embedded, with the distances preserved, to any Euclidean space. In this case, UMAP (panel b) fails to capture the global layout, while t-SNE (panel c) and PHATE (panel d) do. However the coordinate systems of t-SNE and PHATE are drastically different. In both cases, as a result of the embedding into the Euclidean plane, the ratio of the distances $d(1,2) / d(0,1)$ is roughly $\sqrt{3}$ instead of the original $2$, the same is true for the other clusters. This is the optimal embedding that can be achieved when points are projected to Euclidean space. However, in the case of ClusterGraph (panel a), the distances are encoded as labels to the graph edges and therefore we are not restricted by any Euclidean coordinate system.
  • Figure 2: ClusterGraph pipeline with the two possible pruning strategies. Details on the example dataset that has been used to generate the figures can be found in Section \ref{['subsec:cc']}.
  • Figure 3: Two examples of a disconnected $k$-nn graph. Both panels contain $100$ points sampled from the unit square. In panel (a) the points are sampled from the uniform distribution and $k=2$. In panel (b) the points are sampled from two normal distributions centered in $(0.25, 0.25)$ and $(0.75, 0.75)$ with variance $0.1$ and $k=20$.
  • Figure 4: The two pruned ClusterGraphs obtained by removing all edges longer than $39$ (a) and $35$ (b). Each vertex in the ClusterGraph is depicted as a pie chart showing the percentage of points of each class in the corresponding cluster
  • Figure 5: ClusterGraph built on the output of $k$-means for 500 points sampled from two concentric circles. The $20$ clusters are depicted in (a), on top of the $10$-nn graph. The two metric distortion-pruned components are depicted in (b) and subsequently merged and connectivity pruned (c). The vertices' colors are inherited from panel (a).
  • ...and 3 more figures

Theorems & Definitions (10)

  • Definition 1: Image of a cluster
  • Proposition 1: Clustering stability
  • proof
  • Definition 2: Image of a ClusterGraph vertex
  • Proposition 2: ClusterGraph stability
  • proof
  • Remark 1
  • Remark 2
  • Theorem 3: Theorem A in Bernstein2000GraphAT
  • Remark 3