Table of Contents
Fetching ...

Better than classical? The subtle art of benchmarking quantum machine learning models

Joseph Bowles, Shahnawaz Ahmed, Maria Schuld

TL;DR

The paper conducts a large-scale, open-source benchmarking study of 12 quantum machine learning models across 6 binary classification tasks, implemented in PennyLane, and benchmarked against classical baselines. It emphasizes methodological rigor to avoid benchmarking biases and reveals that, on small-scale tasks, classical models typically outperform quantum ones, with entanglement being not universally beneficial. The work highlights the sensitivity of results to data design and hyperparameters, and it raises key questions about the true sources of any observed quantum advantage. Overall, it advocates for more nuanced benchmarking beyond simple leaderboards to guide future quantum model design and data selection.

Abstract

Benchmarking models via classical simulations is one of the main ways to judge ideas in quantum machine learning before noise-free hardware is available. However, the huge impact of the experimental design on the results, the small scales within reach today, as well as narratives influenced by the commercialisation of quantum technologies make it difficult to gain robust insights. To facilitate better decision-making we develop an open-source package based on the PennyLane software framework and use it to conduct a large-scale study that systematically tests 12 popular quantum machine learning models on 6 binary classification tasks used to create 160 individual datasets. We find that overall, out-of-the-box classical machine learning models outperform the quantum classifiers. Moreover, removing entanglement from a quantum model often results in as good or better performance, suggesting that "quantumness" may not be the crucial ingredient for the small learning tasks considered here. Our benchmarks also unlock investigations beyond simplistic leaderboard comparisons, and we identify five important questions for quantum model design that follow from our results.

Better than classical? The subtle art of benchmarking quantum machine learning models

TL;DR

The paper conducts a large-scale, open-source benchmarking study of 12 quantum machine learning models across 6 binary classification tasks, implemented in PennyLane, and benchmarked against classical baselines. It emphasizes methodological rigor to avoid benchmarking biases and reveals that, on small-scale tasks, classical models typically outperform quantum ones, with entanglement being not universally beneficial. The work highlights the sensitivity of results to data design and hyperparameters, and it raises key questions about the true sources of any observed quantum advantage. Overall, it advocates for more nuanced benchmarking beyond simple leaderboards to guide future quantum model design and data selection.

Abstract

Benchmarking models via classical simulations is one of the main ways to judge ideas in quantum machine learning before noise-free hardware is available. However, the huge impact of the experimental design on the results, the small scales within reach today, as well as narratives influenced by the commercialisation of quantum technologies make it difficult to gain robust insights. To facilitate better decision-making we develop an open-source package based on the PennyLane software framework and use it to conduct a large-scale study that systematically tests 12 popular quantum machine learning models on 6 binary classification tasks used to create 160 individual datasets. We find that overall, out-of-the-box classical machine learning models outperform the quantum classifiers. Moreover, removing entanglement from a quantum model often results in as good or better performance, suggesting that "quantumness" may not be the crucial ingredient for the small learning tasks considered here. Our benchmarks also unlock investigations beyond simplistic leaderboard comparisons, and we identify five important questions for quantum model design that follow from our results.
Paper Structure (55 sections, 48 equations, 24 figures, 1 table)

This paper contains 55 sections, 48 equations, 24 figures, 1 table.

Figures (24)

  • Figure 1: The scope of the benchmark study at a glance.
  • Figure 2: Illustrative example showing the effect of slight variations in a dataset on model performance. The same quantum model is trained on two different datasets to predict two classes, red and blue. The decision regions are displayed as the shaded areas. Depending on a small variation of the classification task the same model can perform poorly (left) or have a perfect test score (right). We used a "vanilla" quantum neural network model with two layers of \ref{['glos:angleemb']} interspersed with CNOT entanglers, and three layers of a trainable variational circuit, followed by a $Z$-measurement on the first qubit. The classifier is trained on the points with round markers and tested on the points marked as triangles.
  • Figure 3: Numerical illustration of the thought experiment on a positivity bias. Assume that the "true" performance of classical and quantum models is distributed normally (blue and red curves) with the mean for classical model performance higher than in the quantum case. The dashed lines report numerical calculations of the mean of model performance if $100$ researchers report only on the top-performing candidate out of $20$ quantum models, but do not select the best classical model in a similar manner. The bias from discarding the $19$ worst-performing quantum models reverses the observed average performance with respect to the true one.
  • Figure 4: Publishing date versus citations of the eleven selected papers (red crosses) from an initial set of 3500 papers drawn from the ArXiv API (gray dots). Outliers with over $1500$ citations are not shown. Selecting papers with $30$ or more citations introduces a bias towards less recent work.
  • Figure 5: Illustrative examples of datasets created by the different data generation procedures. For the scatter plots, the two classes are shown in blue and orange, and training points are shown in round vs. test points in an 'x' shape. The linearly separable pannel shows data for the linearly separable benchmark in 2 and 3 dimensions. The left two plots for the MNIST data correspond to 2d and 3d mnist pca data, and the rightmost image shows examples from the mnist-cg dataset for 32 x 32, 16 x 16, 8 x 8 and 4 x 4 pixel grids. The hidden manifold examples correspond to a $1$d (left) and $2$d (center) and $3$d (right) manifold embedded into $3$ dimensions. The bars and stripes panel shows examples from the bars & stripes dataset for a 16 x 16 pixel grid. The examples from the two curves diff benchmark show a degree of $2, 10, 20$ for the Fourier series, embedding the curves into $10$ dimensions (of which three are plotted). The hyperplanes pannel shows data from the hyperplanes diff benchmark, where there are two (left) and five (right) hyperplanes used to decide the class labels.
  • ...and 19 more figures