Table of Contents
Fetching ...

How many qubits does a machine learning problem require?

Sydney Leither, Michael Kubal, Sonika Johri

TL;DR

The paper tackles the fundamental question of how many qubits a learning problem requires in a variational quantum setting. It introduces bit-bit encoding as a universal, efficiently compressive data representation and a concrete resource metric, $Q_{\text{dataset}}(x)$, to estimate qubit needs for target accuracy. Through theory and experiments on MNIST and benchmark datasets, it shows medium-sized classical datasets typically require around $27$ qubits under bit-bit encoding, while larger biological datasets may demand more, especially when batched processing is used. The work provides a principled foundation for benchmarking quantum advantages in machine learning and guides future work toward datasets and encodings where quantum resources could yield a practical edge.

Abstract

For a machine learning paradigm to be generally applicable, it should have the property of universal approximation, that is, it should be able to approximate any target function to any desired degree of accuracy. In variational quantum machine learning, the class of functions that can be learned depend on both the data encoding scheme as well as the architecture of the optimizable part of the model. Here, we show that the property of universal approximation is constructively and efficiently realized by the recently proposed bit-bit data encoding scheme. Further, we show that this construction allows us to calculate the number of qubits required to solve a learning problem on a dataset to a target accuracy, giving rise to the first resource estimation framework for variational quantum machine learning. We apply bit-bit encoding to a number of medium-sized classical benchmark datasets and find that they require only 27 qubits on average for encoding. We extend the basic bit-bit encoding scheme to a variant that efficiently supports batched processing of large datasets. As a demonstration, we apply this new scheme to subsets of a giga-scale transcriptomic dataset. This work establishes bit-bit encoding not only as a universally expressive quantum data representation, but also as a practical foundation for resource estimation and benchmarking in quantum machine learning.

How many qubits does a machine learning problem require?

TL;DR

The paper tackles the fundamental question of how many qubits a learning problem requires in a variational quantum setting. It introduces bit-bit encoding as a universal, efficiently compressive data representation and a concrete resource metric, , to estimate qubit needs for target accuracy. Through theory and experiments on MNIST and benchmark datasets, it shows medium-sized classical datasets typically require around qubits under bit-bit encoding, while larger biological datasets may demand more, especially when batched processing is used. The work provides a principled foundation for benchmarking quantum advantages in machine learning and guides future work toward datasets and encodings where quantum resources could yield a practical edge.

Abstract

For a machine learning paradigm to be generally applicable, it should have the property of universal approximation, that is, it should be able to approximate any target function to any desired degree of accuracy. In variational quantum machine learning, the class of functions that can be learned depend on both the data encoding scheme as well as the architecture of the optimizable part of the model. Here, we show that the property of universal approximation is constructively and efficiently realized by the recently proposed bit-bit data encoding scheme. Further, we show that this construction allows us to calculate the number of qubits required to solve a learning problem on a dataset to a target accuracy, giving rise to the first resource estimation framework for variational quantum machine learning. We apply bit-bit encoding to a number of medium-sized classical benchmark datasets and find that they require only 27 qubits on average for encoding. We extend the basic bit-bit encoding scheme to a variant that efficiently supports batched processing of large datasets. As a demonstration, we apply this new scheme to subsets of a giga-scale transcriptomic dataset. This work establishes bit-bit encoding not only as a universally expressive quantum data representation, but also as a practical foundation for resource estimation and benchmarking in quantum machine learning.

Paper Structure

This paper contains 14 sections, 21 equations, 10 figures.

Figures (10)

  • Figure 1: The train and test theoretical accuracy (left), test-train overlap (middle), and correctly classified test-train overlap (right) over the number of allocated qubits for the MNIST$\_784$ OpenML dataset with PCA dimensionality reduction. Each line shows the mean of the corresponding metric with $95\%$ confidence interval error bands. The gray dotted line is the mean $Q_\text{dataset}(0.99)$ (minus the class qubits). In the rightmost plot, the dip in the testing accuracy between $10$ and $20$ qubits is caused by shifts in the distribution of bits allocated to the individual features. Between $20$ and $50$ qubits, test-train overlap is near but not quite $0$, so the small number of overlapping samples lead to greater shifts in the overall proportion of correctly classified samples.
  • Figure 2: Violin plots of $Q_\text{dataset}(1.0)$ for OpenML benchmark datasets across dimensionality reduction schemes. Each violin depicts the probability density of $Q_\text{dataset}(1.0)$ from the mean $Q_\text{dataset}(1.0)$ of each benchmark dataset.
  • Figure 3: The pairwise relative differences between $Q_\text{dataset}(0.99)$ and $Q_\text{dataset}(1.0)$, $\frac{Q_\text{dataset}(1.0)-Q_\text{dataset}(0.99)}{Q_\text{dataset}(0.99)}$, for each benchmark dataset.
  • Figure 4: A heatmap visualizing how the rounded mean $Q_\text{dataset}(1.0)$ of subsamples of the Tahoe dataset scale with the number of features (x-axis) and number of samples (y-axis).
  • Figure 5: (left) The number of unique samples as a function of the number of bits used to encode the input. The inset shows the same curve with a log scale on the y-axis and compares it with an exponential. (right) Evolution of test accuracy for the first 4 digits of MNIST. $N_q$ is the number of qubits. 2 qubits are reserved for reading out the class label, and $N_q-2$ qubits are available to load the data sample at the input.
  • ...and 5 more figures