Table of Contents
Fetching ...

Spectral Clustering for Discrete Distributions

Zixiao Wang, Dong Qiao, Jicong Fan

TL;DR

The paper addresses clustering discrete distributions by moving beyond Wasserstein barycenter–based centroids to a connectivity-based approach using spectral clustering. It introduces DDSC, which builds an affinity graph from distribution distances (MMD, Wasserstein, or Sinkhorn) and uses sparsified Gaussian kernels with normalized cuts, optionally enhanced by Linear Optimal Transport for scalability. The authors provide consistency and correctness guarantees, backed by Davies–Kahan perturbation analysis, and demonstrate improved clustering accuracy and efficiency on synthetic and real-world text and image datasets. The approach is robust to incomplete distance matrices and scalable to large collections of distributions, making it suitable for complex structured data like bags-of-words and histograms. Overall, DDSC offers a principled, scalable framework for clustering discrete distributions with strong theoretical and empirical support.

Abstract

The discrete distribution is often used to describe complex instances in machine learning, such as images, sequences, and documents. Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods. These methods operate under the assumption that clusters can be well-represented by barycenters, which is seldom true in many real-world applications. Additionally, these methods are not scalable for large datasets due to the high computational cost of calculating Wasserstein barycenters. In this work, we explore the feasibility of using spectral clustering combined with distribution affinity measures (e.g., maximum mean discrepancy and Wasserstein distance) to cluster discrete distributions. We demonstrate that these methods can be more accurate and efficient than barycenter methods. To further enhance scalability, we propose using linear optimal transport to construct affinity matrices efficiently for large datasets. We provide theoretical guarantees for the success of our methods in clustering distributions. Experiments on both synthetic and real data show that our methods outperform existing baselines.

Spectral Clustering for Discrete Distributions

TL;DR

The paper addresses clustering discrete distributions by moving beyond Wasserstein barycenter–based centroids to a connectivity-based approach using spectral clustering. It introduces DDSC, which builds an affinity graph from distribution distances (MMD, Wasserstein, or Sinkhorn) and uses sparsified Gaussian kernels with normalized cuts, optionally enhanced by Linear Optimal Transport for scalability. The authors provide consistency and correctness guarantees, backed by Davies–Kahan perturbation analysis, and demonstrate improved clustering accuracy and efficiency on synthetic and real-world text and image datasets. The approach is robust to incomplete distance matrices and scalable to large collections of distributions, making it suitable for complex structured data like bags-of-words and histograms. Overall, DDSC offers a principled, scalable framework for clustering discrete distributions with strong theoretical and empirical support.

Abstract

The discrete distribution is often used to describe complex instances in machine learning, such as images, sequences, and documents. Traditionally, clustering of discrete distributions (D2C) has been approached using Wasserstein barycenter methods. These methods operate under the assumption that clusters can be well-represented by barycenters, which is seldom true in many real-world applications. Additionally, these methods are not scalable for large datasets due to the high computational cost of calculating Wasserstein barycenters. In this work, we explore the feasibility of using spectral clustering combined with distribution affinity measures (e.g., maximum mean discrepancy and Wasserstein distance) to cluster discrete distributions. We demonstrate that these methods can be more accurate and efficient than barycenter methods. To further enhance scalability, we propose using linear optimal transport to construct affinity matrices efficiently for large datasets. We provide theoretical guarantees for the success of our methods in clustering distributions. Experiments on both synthetic and real data show that our methods outperform existing baselines.
Paper Structure (25 sections, 6 theorems, 50 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 6 theorems, 50 equations, 3 figures, 7 tables, 2 algorithms.

Key Result

Lemma 1

Assume $D_{ij}$ is the $ij$-th entry of $\mathbf{D}$, then with probability at least $1 - \theta$ where $\kappa = 2L|\Omega| + \|c\|_\infty$, $\rho = 6B\frac{\eta\psi}{\sqrt{m}}$, and $E = \sqrt{\frac{2}{m}}\left(\kappa + \varepsilon\exp\left(\frac{\kappa}{\varepsilon}\right)\right)$.

Figures (3)

  • Figure 1: An intuitive comparison between distance-based (left plot) clustering and connectivity-based (right plot) clustering of distributions. Ellipses and rectangles denote distributions. The two clusters are marked in different colors.
  • Figure 2: Visualization of synthetic dataset clustering. Different colors indicate different clustering labels produced by the corresponding algorithm. There are 20 square and circular shape distributions, respectively. Each distribution has 40 support points in $\mathbb{R}^2$ space.
  • Figure 3: Relative error of distance matrices constructed with different metrics.

Theorems & Definitions (14)

  • Definition 1: Connectivity-based distribution clustering
  • Definition 2: Connectivity-based discrete distribution clustering
  • Lemma 1: Error bound of sampling-based Sinkhorn divergence
  • Definition 3: Consistency of Clustering
  • Theorem 1: Consistency of DDSC$_\text{Sinkhorn}$
  • Definition 4: Intra-class neighbor set
  • Definition 5: Inter-class neighbor set
  • Definition 6: Correctness of Clustering
  • Theorem 2: Correctness of DDSC$_{\text{Sinkhorn}}$
  • Lemma 2: Error bound of similarity matrix
  • ...and 4 more