Table of Contents
Fetching ...

Quantum (Inspired) $D^2$-sampling with Applications

Poojan Shah, Ragesh Jaiswal

TL;DR

A quantum algorithm for (approximate) D^2-sampling in the QRAM model results in a fast quantum-inspired classical implementation of $k-means++, which is called QI-$k-means++, with a running time $O(Nd) + \tilde{O}(\zeta^2k^2d)$, where the $O(Nd)$ term is for setting up the sample-query access data structure.

Abstract

$D^2$-sampling is a fundamental component of sampling-based clustering algorithms such as $k$-means++. Given a dataset $V \subset \mathbb{R}^d$ with $N$ points and a center set $C \subset \mathbb{R}^d$, $D^2$-sampling refers to picking a point from $V$ where the sampling probability of a point is proportional to its squared distance from the nearest center in $C$. Starting with empty $C$ and iteratively $D^2$-sampling and updating $C$ in $k$ rounds is precisely $k$-means++ seeding that runs in $O(Nkd)$ time and gives $O(\log{k})$-approximation in expectation for the $k$-means problem. We give a quantum algorithm for (approximate) $D^2$-sampling in the QRAM model that results in a quantum implementation of $k$-means++ that runs in time $\tilde{O}(ζ^2 k^2)$. Here $ζ$ is the aspect ratio (i.e., largest to smallest interpoint distance), and $\tilde{O}$ hides polylogarithmic factors in $N, d, k$. It can be shown through a robust approximation analysis of $k$-means++ that the quantum version preserves its $O(\log{k})$ approximation guarantee. Further, we show that our quantum algorithm for $D^2$-sampling can be 'dequantized' using the sample-query access model of Tang (PhD Thesis, Ewin Tang, University of Washington, 2023). This results in a fast quantum-inspired classical implementation of $k$-means++, which we call QI-$k$-means++, with a running time $O(Nd) + \tilde{O}(ζ^2k^2d)$, where the $O(Nd)$ term is for setting up the sample-query access data structure. Experimental investigations show promising results for QI-$k$-means++ on large datasets with bounded aspect ratio. Finally, we use our quantum $D^2$-sampling with the known $ D^2$-sampling-based classical approximation scheme (i.e., $(1+\varepsilon)$-approximation for any given $\varepsilon>0$) to obtain the first quantum approximation scheme for the $k$-means problem with polylogarithmic running time dependence on $N$.

Quantum (Inspired) $D^2$-sampling with Applications

TL;DR

A quantum algorithm for (approximate) D^2-sampling in the QRAM model results in a fast quantum-inspired classical implementation of k-means++, with a running time , where the term is for setting up the sample-query access data structure.

Abstract

-sampling is a fundamental component of sampling-based clustering algorithms such as -means++. Given a dataset with points and a center set , -sampling refers to picking a point from where the sampling probability of a point is proportional to its squared distance from the nearest center in . Starting with empty and iteratively -sampling and updating in rounds is precisely -means++ seeding that runs in time and gives -approximation in expectation for the -means problem. We give a quantum algorithm for (approximate) -sampling in the QRAM model that results in a quantum implementation of -means++ that runs in time . Here is the aspect ratio (i.e., largest to smallest interpoint distance), and hides polylogarithmic factors in . It can be shown through a robust approximation analysis of -means++ that the quantum version preserves its approximation guarantee. Further, we show that our quantum algorithm for -sampling can be 'dequantized' using the sample-query access model of Tang (PhD Thesis, Ewin Tang, University of Washington, 2023). This results in a fast quantum-inspired classical implementation of -means++, which we call QI--means++, with a running time , where the term is for setting up the sample-query access data structure. Experimental investigations show promising results for QI--means++ on large datasets with bounded aspect ratio. Finally, we use our quantum -sampling with the known -sampling-based classical approximation scheme (i.e., -approximation for any given ) to obtain the first quantum approximation scheme for the -means problem with polylogarithmic running time dependence on .
Paper Structure (28 sections, 42 theorems, 57 equations, 5 figures, 8 tables, 5 algorithms)

This paper contains 28 sections, 42 theorems, 57 equations, 5 figures, 8 tables, 5 algorithms.

Key Result

Theorem 1

There is a quantum implementation of $k$-means++ that runs in time $\tilde{O}(\zeta^2 k^2)$ and gives an $O(\log{k})$ factor approximate solution for the $k$-means problem with a probability of at least $0.99$. Here, $\tilde{O}$ hides $\log^2{(Nd)}$ and $\log^2{(kd)}$ terms.The output of $k$-means++

Figures (5)

  • Figure 1: A tree data structure to enable sample-query access to an example vector of dimension $n = 4$. Index $i$ can be sampled with probability $\frac{|\vec{v}_i|^2}{\sum_j |\vec{v_j|^2}}$ in $O(\log{n})$ time by traversing down the tree.
  • Figure 2: Cumulative runtime plot for MNIST
  • Figure 3: Cumulative runtime plot for IRIS
  • Figure 4: Cumulative runtime plot for KDD
  • Figure 5: Cumulative runtime plot for SUSY

Theorems & Definitions (70)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Definition 1: Query access, Definition 1.1 in tang-thesis
  • Definition 2: SQ-access to a vector, Definition 1.2 in tang-thesis
  • Lemma 1: Remark 4.12 in tang-thesis
  • Lemma 2: kllp19 and wiebe
  • Lemma 3
  • Lemma 4
  • ...and 60 more