QMill: Representative Quantum Data Generation for Quantum Machine Learning Utility
Jason Ludmir, Ian Martin, Nicholas S. DiBrita, Daniel Leeds, Tirthak Patel
TL;DR
QMill tackles the critical need for representative quantum data in QML by generating low-depth, entangled samples whose CE values follow user-defined distributions. It combines a library of lightweight ansatzes, dual-annealing optimization to match CE distributions via TVD, and SWAP-test-based diversity checks, producing scalable, entanglement-aware datasets. The framework is validated on both classical datasets mapped to quantum amplitudes and native quantum datasets, showing faithful CE distribution replication and resilience to noise; a three-qubit QNN trained on QMill data attains performance near a classical baseline. Open-source code and datasets are provided, offering a practical tool for benchmarking and advancing QML under realistic quantum-data conditions.
Abstract
Quantum machine learning (QML) promises significant speedups, particularly when operating on quantum datasets. However, its progress is hindered by the scarcity of suitable training data. Existing synthetic data generation methods fall short in capturing essential entanglement properties, limiting their utility for QML. To address this, we introduce QMill, a low-depth quantum data generation framework that produces entangled, high-quality samples emulating diverse classical and quantum distributions, enabling more effective development and evaluation of QML models in representative-data settings.
