Almost Asymptotically Optimal Active Clustering Through Pairwise Observations
Rachel S. Y. Teo, P. N. Karthik, Ramya Korlakai Vinayak, Vincent Y. F. Tan
TL;DR
This work studies active clustering with noisy pairwise feedback in a fixed-confidence setting, where the goal is to recover an unknown clustering of M items using adaptively queried item pairs. The authors derive an instance-dependent lower bound on the required number of queries via a sup-inf KL-divergence framework and design an asymptotically optimal algorithm that uses a GLR-based stopping rule and a D-tracking inspired sampling strategy. To make the approach practical, they introduce A^3CNP, a computationally feasible variant with a closed-form stopping criterion that preserves near-optimal performance, and they quantify the suboptimality gap with a data-driven proxy. Theoretical results are complemented by experiments showing that A^3CNP achieves substantially faster stopping times than prior methods while maintaining provable guarantees. The framework lays a foundation for adaptive structure discovery in clustering, with potential extensions to richer feedback and scalable large-scale problems.
Abstract
We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses. At each time step, an agent is allowed to query pairs of items and observe bandit binary feedback. If the pair of items belongs to the same (resp.\ different) cluster, the observed feedback is $1$ with probability $p>1/2$ (resp.\ $q<1/2$). Leveraging the ubiquitous change-of-measure technique, we establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the clustering accuracy, formulated as a sup-inf optimization problem. Building on this theoretical foundation, we design an asymptotically optimal algorithm in which the stopping criterion involves an empirical version of the inner infimum -- the Generalized Likelihood Ratio (GLR) statistic -- being compared to a threshold. We develop a computationally feasible variant of the GLR statistic and show that its performance gap to the lower bound can be accurately empirically estimated and remains within a constant multiple of the lower bound.
