Table of Contents
Fetching ...

Almost Asymptotically Optimal Active Clustering Through Pairwise Observations

Rachel S. Y. Teo, P. N. Karthik, Ramya Korlakai Vinayak, Vincent Y. F. Tan

TL;DR

This work studies active clustering with noisy pairwise feedback in a fixed-confidence setting, where the goal is to recover an unknown clustering of M items using adaptively queried item pairs. The authors derive an instance-dependent lower bound on the required number of queries via a sup-inf KL-divergence framework and design an asymptotically optimal algorithm that uses a GLR-based stopping rule and a D-tracking inspired sampling strategy. To make the approach practical, they introduce A^3CNP, a computationally feasible variant with a closed-form stopping criterion that preserves near-optimal performance, and they quantify the suboptimality gap with a data-driven proxy. Theoretical results are complemented by experiments showing that A^3CNP achieves substantially faster stopping times than prior methods while maintaining provable guarantees. The framework lays a foundation for adaptive structure discovery in clustering, with potential extensions to richer feedback and scalable large-scale problems.

Abstract

We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses. At each time step, an agent is allowed to query pairs of items and observe bandit binary feedback. If the pair of items belongs to the same (resp.\ different) cluster, the observed feedback is $1$ with probability $p>1/2$ (resp.\ $q<1/2$). Leveraging the ubiquitous change-of-measure technique, we establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the clustering accuracy, formulated as a sup-inf optimization problem. Building on this theoretical foundation, we design an asymptotically optimal algorithm in which the stopping criterion involves an empirical version of the inner infimum -- the Generalized Likelihood Ratio (GLR) statistic -- being compared to a threshold. We develop a computationally feasible variant of the GLR statistic and show that its performance gap to the lower bound can be accurately empirically estimated and remains within a constant multiple of the lower bound.

Almost Asymptotically Optimal Active Clustering Through Pairwise Observations

TL;DR

This work studies active clustering with noisy pairwise feedback in a fixed-confidence setting, where the goal is to recover an unknown clustering of M items using adaptively queried item pairs. The authors derive an instance-dependent lower bound on the required number of queries via a sup-inf KL-divergence framework and design an asymptotically optimal algorithm that uses a GLR-based stopping rule and a D-tracking inspired sampling strategy. To make the approach practical, they introduce A^3CNP, a computationally feasible variant with a closed-form stopping criterion that preserves near-optimal performance, and they quantify the suboptimality gap with a data-driven proxy. Theoretical results are complemented by experiments showing that A^3CNP achieves substantially faster stopping times than prior methods while maintaining provable guarantees. The framework lays a foundation for adaptive structure discovery in clustering, with potential extensions to richer feedback and scalable large-scale problems.

Abstract

We propose a new analysis framework for clustering items into an unknown number of distinct groups using noisy and actively collected responses. At each time step, an agent is allowed to query pairs of items and observe bandit binary feedback. If the pair of items belongs to the same (resp.\ different) cluster, the observed feedback is with probability (resp.\ ). Leveraging the ubiquitous change-of-measure technique, we establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the clustering accuracy, formulated as a sup-inf optimization problem. Building on this theoretical foundation, we design an asymptotically optimal algorithm in which the stopping criterion involves an empirical version of the inner infimum -- the Generalized Likelihood Ratio (GLR) statistic -- being compared to a threshold. We develop a computationally feasible variant of the GLR statistic and show that its performance gap to the lower bound can be accurately empirically estimated and remains within a constant multiple of the lower bound.
Paper Structure (22 sections, 14 theorems, 145 equations, 1 figure, 1 algorithm)

This paper contains 22 sections, 14 theorems, 145 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1

For a confidence level $\delta \in (0,1)$ and instance $C\in \mathcal{C}$, any $\delta$-correct algorithm satisfies where and $v'_{ij}$ is the distribution for $C'$ analogous to eq:feedback (with corresponding parameters $p' > 1/2 > q'$). Furthermore,

Figures (1)

  • Figure 1: The asymptotic ($\delta \to 0$) sample complexity of $\mathrm{A}^3\mathrm{CNP}$, with varying $\epsilon$ (first argument) and $\sigma$ (second argument) values, relative to the active clustering algorithm of ChenVinayakHassibi2023. Also included in the plot are the theoretical lower \ref{['eq:lower-bound']} and upper bound \ref{['eq:sigma_upper-bound-A3CNP']} and the data-dependent proxy $\widehat{\rm SG}_\epsilon(\sigma; C_{\tau_\delta})$ in \ref{['eqn:data_dep']} evaluated at the stopping time $\tau_\delta$ using $\epsilon=10^{-1}$ and $\sigma=10^{-3}$.

Theorems & Definitions (29)

  • Theorem 1
  • Remark 1
  • Theorem 2
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • proof
  • ...and 19 more