Clustering with Non-adaptive Subset Queries

Hadley Black; Euiwoong Lee; Arya Mazumdar; Barna Saha

Clustering with Non-adaptive Subset Queries

Hadley Black, Euiwoong Lee, Arya Mazumdar, Barna Saha

TL;DR

This work studies recovering a hidden $k$-clustering on $n$ items using non-adaptive subset queries that report how many clusters intersect a query set. By linking subset queries to combinatorial group testing and exploiting random-graph connectivity, the authors devise near-linear non-adaptive algorithms for unrestricted-size queries, with refined bounds for size-bounded and balanced scenarios. They show $O(n\log k\cdot(\log k+\log\log n)^2)$ queries suffice in general (improving to $O(n\log\log n)$ for constant $k$), and provide lower bounds $Ω(\max(n^2/s^2,n))$ when restricting query size to $s$. Additional results cover balanced clusters, and two rounds of adaptivity yield further improvements to $O(n\log k)$ (general) and $O(n\log\log k)$ (balanced). Overall, the paper advances non-adaptive clustering with subset queries by achieving near-linear query complexities across several regimes and linking the problem to established combinatorial- and graph-theoretic techniques.

Abstract

Recovering the underlying clustering of a set $U$ of $n$ points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query $S \subset U$, $|S|=2$, the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be $Θ(nk)$, where $k$ is the number of clusters. However, non-adaptive schemes require $Ω(n^2)$ queries, which matches the trivial $O(n^2)$ upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for $|S|>2$, where the oracle returns the number of clusters intersecting $S$. Allowing for subset queries of unbounded size, $O(n)$ queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making $O(n \log k \cdot (\log k + \log\log n)^2)$ queries, which improves to $O(n \log \log n)$ when $k$ is a constant. We also consider algorithms with a restricted query size of at most $s$. In this setting we prove that $Ω(\max(n^2/s^2,n))$ queries are necessary and obtain algorithms making $\tilde{O}(n^2k/s^2)$ queries for any $s \leq \sqrt{n}$ and $\tilde{O}(n^2/s)$ queries for any $s \leq n$. We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make $O(n \log k) + \tilde{O}(k)$ and $O(n\log^2 k)$ queries. Finally, allowing two rounds of adaptivity, we give an algorithm making $O(n \log k)$ queries in the general case and $O(n \log \log k)$ queries when the clusters are balanced.

Clustering with Non-adaptive Subset Queries

TL;DR

This work studies recovering a hidden

-clustering on

items using non-adaptive subset queries that report how many clusters intersect a query set. By linking subset queries to combinatorial group testing and exploiting random-graph connectivity, the authors devise near-linear non-adaptive algorithms for unrestricted-size queries, with refined bounds for size-bounded and balanced scenarios. They show

queries suffice in general (improving to

for constant

), and provide lower bounds

when restricting query size to

. Additional results cover balanced clusters, and two rounds of adaptivity yield further improvements to

(general) and

(balanced). Overall, the paper advances non-adaptive clustering with subset queries by achieving near-linear query complexities across several regimes and linking the problem to established combinatorial- and graph-theoretic techniques.

Abstract

Recovering the underlying clustering of a set

points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query

, the oracle returns yes if the points are in the same cluster and no otherwise. For adaptive algorithms with pair-wise queries, the number of required queries is known to be

, where

is the number of clusters. However, non-adaptive schemes require

queries, which matches the trivial

upper bound attained by querying every pair of points. To break the quadratic barrier for non-adaptive queries, we study a generalization of this problem to subset queries for

, where the oracle returns the number of clusters intersecting

. Allowing for subset queries of unbounded size,

queries is possible with an adaptive scheme (Chakrabarty-Liao, 2024). However, the realm of non-adaptive algorithms is completely unknown. In this paper, we give the first non-adaptive algorithms for clustering with subset queries. Our main result is a non-adaptive algorithm making

queries, which improves to

when

is a constant. We also consider algorithms with a restricted query size of at most

. In this setting we prove that

queries are necessary and obtain algorithms making

queries for any

and

queries for any

. We also consider the natural special case when the clusters are balanced, obtaining non-adaptive algorithms which make

and

queries. Finally, allowing two rounds of adaptivity, we give an algorithm making

queries in the general case and

queries when the clusters are balanced.

Paper Structure (40 sections, 35 theorems, 53 equations, 8 algorithms)

This paper contains 40 sections, 35 theorems, 53 equations, 8 algorithms.

Introduction
Results
Algorithms with unrestricted query size.
The balanced case.
Allowing two rounds of adaptivity.
Organization.
Ideas and Techniques
A Connection with Combinatorial Group Testing
Obtaining a simple $\widetilde{O}(n)$ algorithm.
Discovering Small Clusters with Large Queries
Bounded query size.
Combining the Two Ideas for our Main Algorithm
Open Questions
Clustering with Subset Queries of Unbounded Size
An $O(n \log k\cdot (\log k + \log \log n)^2)$ Algorithm
...and 25 more sections

Key Result

Theorem 1.1

There is a randomized, non-adaptive $k$-clustering algorithm making $O(n \log k \cdot (\log k + \log \log n)^2)$ subset queries.

Theorems & Definitions (60)

Theorem 1.1: \ref{['thm:1']}, informal
Theorem 1.2: \ref{['thm:nloglogn']}, informal
Theorem 1.3: \ref{['cor:3-s-LB']}, restated
Theorem 1.4: \ref{['thm:bounded-2']}, informal
Theorem 1.5: \ref{['thm:bounded-1']}, informal
Theorem 1.6: \ref{['thm:k-bal-1', 'thm:k-bal-2']}, informal
Theorem 1.7: \ref{['thm:2-round', 'thm:2-round-bal']}, informal
Lemma 1.7
Corollary 1.8
proof
...and 50 more

Clustering with Non-adaptive Subset Queries

TL;DR

Abstract

Clustering with Non-adaptive Subset Queries

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (60)