Clustering with Non-adaptive Subset Queries
Hadley Black, Euiwoong Lee, Arya Mazumdar, Barna Saha
TL;DR
This work studies recovering a hidden $k$-clustering on $n$ items using non-adaptive subset queries that report how many clusters intersect a query set. By linking subset queries to combinatorial group testing and exploiting random-graph connectivity, the authors devise near-linear non-adaptive algorithms for unrestricted-size queries, with refined bounds for size-bounded and balanced scenarios. They show $O(n\log k\cdot(\log k+\log\log n)^2)$ queries suffice in general (improving to $O(n\log\log n)$ for constant $k$), and provide lower bounds $Ω(\max(n^2/s^2,n))$ when restricting query size to $s$. Additional results cover balanced clusters, and two rounds of adaptivity yield further improvements to $O(n\log k)$ (general) and $O(n\log\log k)$ (balanced). Overall, the paper advances non-adaptive clustering with subset queries by achieving near-linear query complexities across several regimes and linking the problem to established combinatorial- and graph-theoretic techniques.
Abstract
Recovering the underlying clustering of a set $U$ of $n$ points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query $S \subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be $Θ(nk)$, where $k$ is the number of clusters. However, non-adaptive schemes require $Ω(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for $|S|>2$, where the oracle returns the number of clusters intersecting $S$. Allowing for subset queries of unbounded size, $O(n)$ queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making $O(n \log k \cdot (\log k + \log\log n)^2)$ queries, which improves to $O(n \log \log n)$ when $k$ is a constant. We also consider algorithms with a restricted query size of at most $s$. In this setting we prove that $Ω(\max(n^2/s^2,n))$ queries are necessary and obtain algorithms making $\tilde{O}(n^2k/s^2)$ queries for any $s \leq \sqrt{n}$ and $\tilde{O}(n^2/s)$ queries for any $s \leq n$. We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make $O(n \log k) + \tilde{O}(k)$ and $O(n\log^2 k)$ queries. Finally, allowing two rounds of adaptivity, we give an algorithm making $O(n \log k)$ queries in the general case and $O(n \log \log k)$ queries when the clusters are balanced.
