Constructing Decision Trees from Data Streams
Huy Pham, Hoang Ta, Hoa T. Vu
TL;DR
The paper develops streaming and massively parallel algorithms to construct optimal splits for decision trees from data streams without assuming i.i.d. data, addressing both regression and classification with numerical and categorical observations. It provides exact and approximation algorithms that operate in sublinear space and a small number of passes, with additional results enabling extension to MPC/MapReduce. Guarantees include $L(j) \le \mathrm{OPT}+\epsilon$ and $(1+\epsilon)\mathrm{OPT}$-type bounds for various losses (MSE, misclassification rate, and Gini impurity), plus lower bounds for categorical data and handling of deletions via Count-Min sketches. The work offers a unified framework for streaming decision-tree construction, applicable to massive, evolving datasets and scalable in distributed environments, bridging traditional CART-like objectives with modern streaming and parallel models.
Abstract
In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations \(x_i\) and their corresponding labels \(y_i\), without the i.i.d. assumption, the objective is to identify the optimal split \(j\) that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).
