Table of Contents
Fetching ...

QMill: Representative Quantum Data Generation for Quantum Machine Learning Utility

Jason Ludmir, Ian Martin, Nicholas S. DiBrita, Daniel Leeds, Tirthak Patel

TL;DR

QMill tackles the critical need for representative quantum data in QML by generating low-depth, entangled samples whose CE values follow user-defined distributions. It combines a library of lightweight ansatzes, dual-annealing optimization to match CE distributions via TVD, and SWAP-test-based diversity checks, producing scalable, entanglement-aware datasets. The framework is validated on both classical datasets mapped to quantum amplitudes and native quantum datasets, showing faithful CE distribution replication and resilience to noise; a three-qubit QNN trained on QMill data attains performance near a classical baseline. Open-source code and datasets are provided, offering a practical tool for benchmarking and advancing QML under realistic quantum-data conditions.

Abstract

Quantum machine learning (QML) promises significant speedups, particularly when operating on quantum datasets. However, its progress is hindered by the scarcity of suitable training data. Existing synthetic data generation methods fall short in capturing essential entanglement properties, limiting their utility for QML. To address this, we introduce QMill, a low-depth quantum data generation framework that produces entangled, high-quality samples emulating diverse classical and quantum distributions, enabling more effective development and evaluation of QML models in representative-data settings.

QMill: Representative Quantum Data Generation for Quantum Machine Learning Utility

TL;DR

QMill tackles the critical need for representative quantum data in QML by generating low-depth, entangled samples whose CE values follow user-defined distributions. It combines a library of lightweight ansatzes, dual-annealing optimization to match CE distributions via TVD, and SWAP-test-based diversity checks, producing scalable, entanglement-aware datasets. The framework is validated on both classical datasets mapped to quantum amplitudes and native quantum datasets, showing faithful CE distribution replication and resilience to noise; a three-qubit QNN trained on QMill data attains performance near a classical baseline. Open-source code and datasets are provided, offering a practical tool for benchmarking and advancing QML under realistic quantum-data conditions.

Abstract

Quantum machine learning (QML) promises significant speedups, particularly when operating on quantum datasets. However, its progress is hindered by the scarcity of suitable training data. Existing synthetic data generation methods fall short in capturing essential entanglement properties, limiting their utility for QML. To address this, we introduce QMill, a low-depth quantum data generation framework that produces entangled, high-quality samples emulating diverse classical and quantum distributions, enabling more effective development and evaluation of QML models in representative-data settings.

Paper Structure

This paper contains 34 sections, 10 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: QMill takes classical product states and generates diverse and customizable quantum data for QML tasks.
  • Figure 2: QMill develops a variety of ansatz designs for real and synthetic CE distributions.
  • Figure 3: In addition to the CE distributions of real data, QMill also tests its efficacy for different CE distributions.
  • Figure 4: QMill uses the SWAP test to validate the dissimilarity of any two random samples with similar CE values.
  • Figure 5: Showcase of top-performing circuits training to mimic the CE of various arbitrary, stress-testing, and real-dataset distributions.
  • ...and 4 more figures