Table of Contents
Fetching ...

QC-Forest: a Classical-Quantum Algorithm to Provably Speedup Retraining of Random Forest

Romina Yalovetzky, Niraj Kumar, Changhao Li, Marco Pistoia

TL;DR

QC-Forest presents a hybrid classical-quantum approach to construction and time-efficient retraining of random forests for streaming data. By building on Des-q and extending it to multi-class tasks, it uses supervised q-means with weighted distances and exact classical feature-weight updates to achieve poly-logarithmic retraining time in the total sample size $N$, while preserving predictive accuracy comparable to state-of-the-art RF methods. A key contribution is efficiently estimating leaf-class probabilities and the multi-class $\eta$ coefficient to enable multiclass inference and threshold tuning, respectively. The work demonstrates competitive performance on benchmarks up to $N \approx 8\times 10^4$ and shows substantial speedups for incremental retraining in data-stream scenarios, highlighting practical potential as quantum hardware matures.

Abstract

Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications where data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.

QC-Forest: a Classical-Quantum Algorithm to Provably Speedup Retraining of Random Forest

TL;DR

QC-Forest presents a hybrid classical-quantum approach to construction and time-efficient retraining of random forests for streaming data. By building on Des-q and extending it to multi-class tasks, it uses supervised q-means with weighted distances and exact classical feature-weight updates to achieve poly-logarithmic retraining time in the total sample size , while preserving predictive accuracy comparable to state-of-the-art RF methods. A key contribution is efficiently estimating leaf-class probabilities and the multi-class coefficient to enable multiclass inference and threshold tuning, respectively. The work demonstrates competitive performance on benchmarks up to and shows substantial speedups for incremental retraining in data-stream scenarios, highlighting practical potential as quantum hardware matures.

Abstract

Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications where data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.
Paper Structure (22 sections, 12 theorems, 41 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 22 sections, 12 theorems, 41 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Given the new training data $D_{N_{\text{new}}}=\{(\boldsymbol{x}_i, y_i)\}_{i=1}^{N_{\text{new}}}$ where each $(\boldsymbol{x}_i, y_i) \in \mathbb{R}^{d} \times \mathcal{M}$, with $\mathcal{M} = \{l_1, \dots, l_c\}$, such that $N_{\text{new}}\ll N$, and the output of QC-Forest-BUILD (Algorithm alg: where, $T_{\text{load-new}}$ is the time to load the new samples in the KP-tree, $T_{\text{weights-

Figures (6)

  • Figure 1: Diagram of QC-Forest-BUILD. For each $n$-th tree, Step 1 consists of sampling from the training data to get $\mathcal{S}_{n}$ and calculating both the feature weights and the weighted centroids ($\{\boldsymbol{c}_{lw}\}_{l}$) on the classical computer. In Step 2 the data is stored, and both $\mathcal{S}_{n}$ and $\{\boldsymbol{c}_{lw}\}_{l}$ are loaded into a quantum-accessible data structure (blue disk). The samples $\mathcal{S}_{n}$ and its corresponding feature weights are stored in the classical data structure (black disk) as they will be used when doing retraining with this data and a new batch of data. In Step 3, the supervised $q$-means is performed, where in each iteration some new centroids are output and utilized to create the weighted centroids to be used in the next iteration. Once it converges, the weighted centroids (or the centroids themselves) are stored in classical memory. This is repeated until reaching the maximum depth $D$ in Step 4. In Step 5, the leaf label extraction is performed. The probability of each class ($\{P_{i}\}_{i}$) for classification (or the mean value for regression) is estimated.
  • Figure 2: ROC AUC as a function of the number of samples used to do incremental retraining. (a) corresponds to a test set of size 250 samples, corresponding to a fraction of $0.25$ of the size of the first batch used and (b) corresponds to a test size of $4787$ (total samples used in retraining are 5300). Orange line corresponds to the median value obtained with QC-Forest-c and the blue line to the one obtained with the baseline. These are the median over the five different sampling experiments and the shaded area corresponds to the standard deviation obtained.
  • Figure 3: Entropy of a single tree at each depth as a function of the depth. The entropy at each depth corresponds to the sum of the entropy of the labels at each node weighted by the fraction of training samples at each node. We compare QC-Forest-c to the baseline. "Unsupervised" refers to QC-Forest-c that performs unsupervised clustering that implements Euclidean distance ($w_j = 1$, $\forall j$, where $w_j$ is the weight of the $j$-th feature). This corresponds to one tree over the ensemble trained over the five folds. The lines correspond to the median values and the bars to the standard deviation across the folds.
  • Figure 4: Performance in test. ROC AUC as a function of the number of clusters for different tree depths (D) for an ensemble of 100 trees constructed with the PIMA dataset.
  • Figure 5: Accuracy in test a function of the threshold to assign label given the probabilities of each of the two classes. The vertical black line corresponds to the threshold when taking the majority vote $t=0.5$ and the vertical green line corresponds to the "best" threshold for which the accuracy is maximum. The dataset corresponds to one of the five folds created for the "German" dataset and the model is an ensemble of 100 trees with $D=3$ and $k=3$.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Definition 1
  • Theorem 1: Time complexity of QC-Forest-RETRAIN
  • proof
  • Theorem 2: Time complexity to classicalyl update Pearson correlation
  • proof
  • Theorem 3: Time complexity to classically update $\eta$ coefficient
  • proof
  • Theorem 4
  • Theorem 5: Time complexity of leaf label assignment for classification
  • proof
  • ...and 9 more