Table of Contents
Fetching ...

A Distribution Testing Approach to Clustering Distributions

Gunjan Kumar, Yash Pote, Jonathan Scarlett

TL;DR

This work tackles the problem of clustering a set of distributions into two groups with identical members in each group, where distributions from different groups are ε-far in total variation. It introduces a two-stage, distribution-testing-based approach that first finds exemplar distributions from each cluster and then classifies the remaining distributions, achieving tight (up to a log k factor) upper and lower bounds on sample complexity across regimes defined by n, k, r, and ε. The results distinguish between the one-known-one-unknown and the both-unknown cases, employing tools from distribution testing (identity/equivalence/uniformity), likelihood-free hypothesis testing (LFHT), and unequal-sample testing to derive both the algorithms and corresponding lower bounds. The findings illuminate the fundamental role of cluster-size r and demonstrate that adaptive two-stage strategies are nearly optimal for finite-sample, constant-error clustering of distributions. These insights advance practical finite-sample clustering of distributions and connect distribution testing techniques to clustering challenges in unsupervised learning and bandit-inspired settings.

Abstract

We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes.

A Distribution Testing Approach to Clustering Distributions

TL;DR

This work tackles the problem of clustering a set of distributions into two groups with identical members in each group, where distributions from different groups are ε-far in total variation. It introduces a two-stage, distribution-testing-based approach that first finds exemplar distributions from each cluster and then classifies the remaining distributions, achieving tight (up to a log k factor) upper and lower bounds on sample complexity across regimes defined by n, k, r, and ε. The results distinguish between the one-known-one-unknown and the both-unknown cases, employing tools from distribution testing (identity/equivalence/uniformity), likelihood-free hypothesis testing (LFHT), and unequal-sample testing to derive both the algorithms and corresponding lower bounds. The findings illuminate the fundamental role of cluster-size r and demonstrate that adaptive two-stage strategies are nearly optimal for finite-sample, constant-error clustering of distributions. These insights advance practical finite-sample clustering of distributions and connect distribution testing techniques to clustering challenges in unsupervised learning and bandit-inspired settings.

Abstract

We study the following distribution clustering problem: Given a hidden partition of distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are -far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size , number of distributions , size of one of the clusters, and distance . In particular, we achieve tightness with respect to (up to an factor) for all regimes.

Paper Structure

This paper contains 57 sections, 23 theorems, 30 equations, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $P$ and $Q$ be two distributions with $\mathrm{d_{TV}}(P, Q) \gtrsim \varepsilon$. Given sample access to $P$ and $Q$, there exists a procedure (which we refer to as $\mathtt{MultiLFHT}$) that, with probability at least $\frac{8}{9}$, correctly classifies distributions $\{D_i\}_{i = 1}^k$ as $P$

Theorems & Definitions (38)

  • Theorem 3.1
  • proof
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • proof
  • Corollary 3.1
  • proof
  • Theorem 3.2
  • Theorem 4.1
  • ...and 28 more