Modeling Quantum Machine Learning for Genomic Data Analysis
Navneet Singh, Shiva Raj Pokhrel
TL;DR
The paper investigates the applicability of quantum machine learning to binary genomic sequence classification by evaluating multiple QML models (QSVC, Pegasos-QSVM, VQC, QNN) under different feature maps (ZFeatureMap, ZZFeatureMap, PauliFeatureMap) using an open-source Qiskit-based workflow. By reducing genomic data dimensionality with PCA to four components and encoding via distinct feature maps, the study reveals a strong dependence of classifier performance on both the feature map and the algorithm, with Pegasos-QSVM achieving high recall and QNN delivering the best training accuracy but potential overfitting risk. The convergence analyses show QSVM’s convex dual structure and Pegasos’ favorable stochastic optimization, while VQC and QNN introduce considerations of expressiveness, gradient-based trainability, and barren plateaus in quantum parameter landscapes. Overall, the work demonstrates the potential of QML for genomic data classification on NISQ-like devices, emphasizes the critical role of feature-map design, and outlines directions for improving robustness, generalization, and noise resilience for practical genomics applications.
Abstract
Quantum Machine Learning (QML) continues to evolve, unlocking new opportunities for diverse applications. In this study, we investigate and evaluate the applicability of QML models for binary classification of genome sequence data by employing various feature mapping techniques. We present an open-source, independent Qiskit-based implementation to conduct experiments on a benchmark genomic dataset. Our simulations reveal that the interplay between feature mapping techniques and QML algorithms significantly influences performance. Notably, the Pegasos Quantum Support Vector Classifier (Pegasos-QSVC) exhibits high sensitivity, particularly excelling in recall metrics, while Quantum Neural Networks (QNN) achieve the highest training accuracy across all feature maps. However, the pronounced variability in classifier performance, dependent on feature mapping, highlights the risk of overfitting to localized output distributions in certain scenarios. This work underscores the transformative potential of QML for genomic data classification while emphasizing the need for continued advancements to enhance the robustness and accuracy of these methodologies.
