A Quantum Approach to Synthetic Minority Oversampling Technique (SMOTE)
Nishikanta Mohanty, Bikash K. Behera, Christopher Ferrie, Pravat Dash
TL;DR
This paper introduces Quantum-SMOTE, a quantum-enhanced method for addressing class imbalance by generating synthetic minority samples through swap tests and quantum rotations, removing reliance on neighbor-based interpolation. It leverages a compact swap-test circuit to compute angular distances between cluster centroids and minority points, followed by small-angle rotations to produce synthetic data, with hyperparameters for rotation angle, minority percentage, and splitting factor. The method is demonstrated on a Telecom Churn dataset using Random Forest and Logistic Regression, showing improved precision-recall and ROC performance as synthetic minority data increases (e.g., higher PR and AUC with 40–50% synthetic data). The study highlights low-depth quantum circuits and reduced qubit requirements as advantages for practical quantum-enhanced data augmentation, indicating potential for scalable, quantum-assisted handling of imbalanced datasets. The results suggest Quantum-SMOTE can augment modern ML pipelines, particularly when high-dimensional data and class imbalance intersect, while maintaining fidelity to the original distribution through controlled rotations and clustering.
Abstract
The paper proposes the Quantum-SMOTE method, a novel solution that uses quantum computing techniques to solve the prevalent problem of class imbalance in machine learning datasets. Quantum-SMOTE, inspired by the Synthetic Minority Oversampling Technique (SMOTE), generates synthetic data points using quantum processes such as swap tests and quantum rotation. The process varies from the conventional SMOTE algorithm's usage of K-Nearest Neighbors (KNN) and Euclidean distances, enabling synthetic instances to be generated from minority class data points without relying on neighbor proximity. The algorithm asserts greater control over the synthetic data generation process by introducing hyperparameters such as rotation angle, minority percentage, and splitting factor, which allow for customization to specific dataset requirements. Due to the use of a compact swap test, the algorithm can accommodate a large number of features. Furthermore, the approach is tested on a public dataset of Telecom Churn and evaluated alongside two prominent classification algorithms, Random Forest and Logistic Regression, to determine its impact along with varying proportions of synthetic data.
