Table of Contents
Fetching ...

Khatri-Rao Clustering for Data Summarization

Martino Ciaperoni, Collin Leiber, Aristides Gionis, Heikki Mannila

TL;DR

This work introduces the Khatri-Rao k-Means algorithm and the Khatri-Rao deep clustering framework, and shows that Khatri-Rao k-Means can strike a more favorable trade-off between succinctness and accuracy in data summarization than standard k-Means.

Abstract

As datasets continue to grow in size and complexity, finding succinct yet accurate data summaries poses a key challenge. Centroid-based clustering, a widely adopted approach to address this challenge, finds informative summaries of datasets in terms of few prototypes, each representing a cluster in the data. Despite their wide adoption, the resulting data summaries often contain redundancies, limiting their effectiveness particularly in datasets characterized by a large number of underlying clusters. To overcome this limitation, we introduce the Khatri-Rao clustering paradigm that extends traditional centroid-based clustering to produce more succinct but equally accurate data summaries by postulating that centroids arise from the interaction of two or more succinct sets of protocentroids. We study two central approaches to centroid-based clustering, namely the well-established k-Means algorithm and the increasingly popular topic of deep clustering, under the lens of the Khatri-Rao paradigm. To this end, we introduce the Khatri-Rao k-Means algorithm and the Khatri-Rao deep clustering framework. Extensive experiments show that Khatri-Rao k-Means can strike a more favorable trade-off between succinctness and accuracy in data summarization than standard k-Means. Leveraging representation learning, the Khatri-Rao deep clustering framework offers even greater benefits, reducing even more the size of data summaries given by deep clustering while preserving their accuracy.

Khatri-Rao Clustering for Data Summarization

TL;DR

This work introduces the Khatri-Rao k-Means algorithm and the Khatri-Rao deep clustering framework, and shows that Khatri-Rao k-Means can strike a more favorable trade-off between succinctness and accuracy in data summarization than standard k-Means.

Abstract

As datasets continue to grow in size and complexity, finding succinct yet accurate data summaries poses a key challenge. Centroid-based clustering, a widely adopted approach to address this challenge, finds informative summaries of datasets in terms of few prototypes, each representing a cluster in the data. Despite their wide adoption, the resulting data summaries often contain redundancies, limiting their effectiveness particularly in datasets characterized by a large number of underlying clusters. To overcome this limitation, we introduce the Khatri-Rao clustering paradigm that extends traditional centroid-based clustering to produce more succinct but equally accurate data summaries by postulating that centroids arise from the interaction of two or more succinct sets of protocentroids. We study two central approaches to centroid-based clustering, namely the well-established k-Means algorithm and the increasingly popular topic of deep clustering, under the lens of the Khatri-Rao paradigm. To this end, we introduce the Khatri-Rao k-Means algorithm and the Khatri-Rao deep clustering framework. Extensive experiments show that Khatri-Rao k-Means can strike a more favorable trade-off between succinctness and accuracy in data summarization than standard k-Means. Leveraging representation learning, the Khatri-Rao deep clustering framework offers even greater benefits, reducing even more the size of data summaries given by deep clustering while preserving their accuracy.
Paper Structure (19 sections, 3 theorems, 16 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 3 theorems, 16 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

proposition 1

The optimal updates of the $j$-th protocentroid in the first and second set of protocentroids at any iteration of Khatri-Rao-$k$-Means are given by if $\oplus\xspace = \times$, or if $\oplus\xspace = +$:

Figures (10)

  • Figure 1: $\textsc{stickfigures}$ dataset. Example of two sets of $3$ protocentroids interacting additively to generate $9$ centroids.
  • Figure 2: For a synthetic (Blobs) and a real ($\textsc{optdigits}$) dataset, we show relative percentage changes in unsupervised clustering accuracy and parameter count for clustering solutions from algorithms based on the Khatri-Rao paradigm relative to the baseline algorithms $k$-Means, Deep-k-Means (DKM) and Improved Deep Embedded Clustering (IDEC).
  • Figure 3: Diagram showing the interactions of two sets of protocentroids to generate cluster centroids.
  • Figure 4: Khatri-Rao-based (top) and arbitrarily-structured (bottom) synthetic data. Combining additively ($\oplus\xspace=+$) or multiplicatively ($\oplus\xspace=\times$) the first (red triangles) and second (blue triangles) sets of protocentroids yields the cluster centroids (purple diamonds). Gray lines indicate which protocentroid affects each cluster centroid.
  • Figure 5: Diagram summarizing the Khatri-Rao clustering paradigm in the $k$-Means(a) and deep clustering (b) settings; in this example, $p\xspace = 3$, $n_l\xspace=2$ and $q\xspace=2$. The centroid in red is obtained by aggregating the protocentroids in red.
  • ...and 5 more figures

Theorems & Definitions (3)

  • proposition 1
  • proposition 2
  • proposition 3