KRAFTY: Khatri-Rao Framework for Joint Cluster Recovery

Siyi Gao; Zachary Lubberts; Marianna Pensky

KRAFTY: Khatri-Rao Framework for Joint Cluster Recovery

Siyi Gao, Zachary Lubberts, Marianna Pensky

TL;DR

This paper introduces this multi-view clustering model and a method for recovering it: the transposed Khatri-RAo Framework for joinT cluster recoverY (KRAFTY), which is flexible and can accommodate a variety of data-generating processes, including latent positions in random dot product graphs and Gaussian mixtures.

Abstract

When multiple datasets describe complementary information about the same set of entities, for example, brain scans of an individual over time, global trade network across years, or user information across social media platforms, integrating these snapshots allows us to see a more holistic picture. A common way of identifying structure in data is through clustering, but while clustering may be applied to each dataset separately, we learn more in the multi-view setting by identifying joint clusters. We consider a clustering problem where each view conflates some of these joint clusters, only revealing partial information, and seek to recover the true joint cluster structure. We introduce this multi-view clustering model and a method for recovering it: the transposed Khatri-RAo Framework for joinT cluster recoverY (KRAFTY). The model is flexible and can accommodate a variety of data-generating processes, including latent positions in random dot product graphs and Gaussian mixtures. A key advantage of KRAFTY is that it represents joint clusters in a space with sufficient dimension so that each joint cluster occupies an orthogonal subspace in the transposed Khatri-Rao matrix, which results in a sharp drop in the scree plot at the true number of joint clusters, enabling easy model selection. Our simulations show that when the number of joint clusters exceeds the sum of the numbers of clusters in each individual view, our method outperforms existing methods in both joint clustering accuracy and estimation of the number of joint clusters.

KRAFTY: Khatri-Rao Framework for Joint Cluster Recovery

TL;DR

Abstract

Paper Structure (36 sections, 9 theorems, 119 equations, 38 figures, 3 algorithms)

This paper contains 36 sections, 9 theorems, 119 equations, 38 figures, 3 algorithms.

Introduction
Background and Motivation
Notations
The Model
Related Work
Methodology
KRAFTY Joint Clustering Methodology
Joint Clustering Algorithms
A Common Data-Driven Example.
Theoretical Results
Assumptions
Accuracy of Joint Cluster Recovery
Estimating the Number of Joint Clusters
Simulation Results
Simulation Design
...and 21 more sections

Key Result

lemma 1

Consider model eq:k-means-data with $G^{(v)}$ given by eq:k-means-mean. Let $d_v \geq K_v$ and the ratio between the smallest and the largest nonzero eigenvalues of $\mathcal{M}^{(v)}$ is bounded below by a constant, i.e. Suppose the elements of the matrix $\Xi^{(v)}$ in eq:k-means-data are independent sub-gaussian or sub-exponential random variables, so that, for any $t>0$ and some absolute cons

Figures (38)

Figure 1: Graphical illustration of the intuition behind the transposed Khatri-Rao product.
Figure 2: Graphical representation for a common data-driven example.
Figure 3: Joint clustering performance of KRAFTY and MASE using $\widehat{Z}^{(v)}$ or $\widehat{U}^{(v)}$ matrices, evaluated under known and unknown $K$. ARI measures agreement between estimated and true joint clusters. $K_1 = K_2 = 4, \sigma^2=0.1, p=20.$
Figure 4: Joint clustering performance of the $\widehat{Z}^{(v)}$ and $\widehat{U}^{(v)}$ matrices using KR and MASE under unknown $K$ setting. The true number of joint clusters $K$ is held constant at 4, 9, and 15 across experiments, with $\sigma^2$ varying. ARI compares estimated joint clusters with ground truth. Here, $K_1 = K_2 = 4, p=20.$
Figure 5: Comparison of joint cluster number estimation performance between KR and MASE across $\widehat{Z}^{(v)}$ and $\widehat{U}^{(v)}$ matrices, when varying the number of true clusters $K$ and noise level $\sigma^2$. Heatmap colors represent the difference in mean absolute error (averaged over 100 repetitions) when estimating $K$ using MASE versus KR. The width of the largest credible interval for the $\widehat{U}^{(v)}$ matrices ($=2(1.96)SE$) is 1.06, for the $\widehat{Z}^{(v)}$ matrices it is 1.15. Cool colors $\to$ MASE better; warm colors $\to$ KR better.$K_1 = K_2 = 4, p=20.$
...and 33 more figures

Theorems & Definitions (10)

remark 1
lemma 1: Incoherence of matrix $\widehat{U}^{(v)}$
theorem 1: Accuracy of joint cluster recovery on the basis of $\widehat{Z}^{(v)}$
theorem 2: Consistent recovery of joint clusters on the basis of $\widehat{U}^{(v)}$
theorem 3: Perfect clustering
theorem 4: Perfect clustering with HC
theorem 5: Consistent estimation of the number of joint clusters on the basis of $\widehat{Z}^{(v)}$
theorem 6: Consistent estimation of the number of joint clusters on the basis of $\widehat{U}^{(v)}$
theorem 7: Consistent estimation of the number of joint clusters with HC
lemma 2

KRAFTY: Khatri-Rao Framework for Joint Cluster Recovery

TL;DR

Abstract

KRAFTY: Khatri-Rao Framework for Joint Cluster Recovery

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (38)

Theorems & Definitions (10)