Table of Contents
Fetching ...

Constructing Decision Trees from Data Streams

Huy Pham, Hoang Ta, Hoa T. Vu

TL;DR

The paper develops streaming and massively parallel algorithms to construct optimal splits for decision trees from data streams without assuming i.i.d. data, addressing both regression and classification with numerical and categorical observations. It provides exact and approximation algorithms that operate in sublinear space and a small number of passes, with additional results enabling extension to MPC/MapReduce. Guarantees include $L(j) \le \mathrm{OPT}+\epsilon$ and $(1+\epsilon)\mathrm{OPT}$-type bounds for various losses (MSE, misclassification rate, and Gini impurity), plus lower bounds for categorical data and handling of deletions via Count-Min sketches. The work offers a unified framework for streaming decision-tree construction, applicable to massive, evolving datasets and scalable in distributed environments, bridging traditional CART-like objectives with modern streaming and parallel models.

Abstract

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations \(x_i\) and their corresponding labels \(y_i\), without the i.i.d. assumption, the objective is to identify the optimal split \(j\) that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).

Constructing Decision Trees from Data Streams

TL;DR

The paper develops streaming and massively parallel algorithms to construct optimal splits for decision trees from data streams without assuming i.i.d. data, addressing both regression and classification with numerical and categorical observations. It provides exact and approximation algorithms that operate in sublinear space and a small number of passes, with additional results enabling extension to MPC/MapReduce. Guarantees include and -type bounds for various losses (MSE, misclassification rate, and Gini impurity), plus lower bounds for categorical data and handling of deletions via Count-Min sketches. The work offers a unified framework for streaming decision-tree construction, applicable to massive, evolving datasets and scalable in distributed environments, bridging traditional CART-like objectives with modern streaming and parallel models.

Abstract

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations and their corresponding labels , without the i.i.d. assumption, the objective is to identify the optimal split that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).
Paper Structure (26 sections, 15 theorems, 68 equations, 3 figures, 6 algorithms)

This paper contains 26 sections, 15 theorems, 68 equations, 3 figures, 6 algorithms.

Key Result

Theorem 1

For regression, we have the following algorithms:

Figures (3)

  • Figure 1: The left figure is an example of regression. The optimal split is $j = 4$ which minimizes the mean squared error. The right figure is an example of classification. The optimal split is $j = 4$ which minimizes the misclassification rate.
  • Figure 2: Result summary. R: Regression, C: Classification, CC: classification with categorical attributes. 1: loss function based on misclassification rate, 2: loss function based on Gini impurity.
  • Figure 3: Result summary. R: Regression, C: Classification, CC: classification with categorical attributes. 1: loss function based on misclassification rate, 2: loss function based on Gini impurity.

Theorems & Definitions (32)

  • Theorem 1: Main Result 1
  • Theorem 2: Main Result 2
  • Theorem 3: Main Result 3
  • proof : Proof of Theorem \ref{['thm:main-regression']} (1)
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • proof : Proof of Theorem \ref{['thm:main-regression']} (2)
  • Lemma 6
  • ...and 22 more