Constructing Decision Trees from Data Streams

Huy Pham; Hoang Ta; Hoa T. Vu

Constructing Decision Trees from Data Streams

Huy Pham, Hoang Ta, Hoa T. Vu

TL;DR

The paper develops streaming and massively parallel algorithms to construct optimal splits for decision trees from data streams without assuming i.i.d. data, addressing both regression and classification with numerical and categorical observations. It provides exact and approximation algorithms that operate in sublinear space and a small number of passes, with additional results enabling extension to MPC/MapReduce. Guarantees include $L(j) \le \mathrm{OPT}+\epsilon$ and $(1+\epsilon)\mathrm{OPT}$-type bounds for various losses (MSE, misclassification rate, and Gini impurity), plus lower bounds for categorical data and handling of deletions via Count-Min sketches. The work offers a unified framework for streaming decision-tree construction, applicable to massive, evolving datasets and scalable in distributed environments, bridging traditional CART-like objectives with modern streaming and parallel models.

Abstract

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations $x_i$ and their corresponding labels $y_i$, without the i.i.d. assumption, the objective is to identify the optimal split $j$ that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).

Constructing Decision Trees from Data Streams

TL;DR

and

-type bounds for various losses (MSE, misclassification rate, and Gini impurity), plus lower bounds for categorical data and handling of deletions via Count-Min sketches. The work offers a unified framework for streaming decision-tree construction, applicable to massive, evolving datasets and scalable in distributed environments, bridging traditional CART-like objectives with modern streaming and parallel models.

Abstract

In this work, we present data stream algorithms to compute optimal splits for decision tree learning. In particular, given a data stream of observations

and their corresponding labels

, without the i.i.d. assumption, the objective is to identify the optimal split

that partitions the data into two sets, minimizing the mean squared error (for regression) or the misclassification rate and Gini impurity (for classification). We propose several efficient streaming algorithms that require sublinear space and use a small number of passes to solve these problems. These algorithms can also be extended to the MapReduce model. Our results, while not directly comparable, complements the seminal work of Domingos-Hulten (KDD 2000) and Hulten-Spencer-Domingos (KDD 2001).

Paper Structure (26 sections, 15 theorems, 68 equations, 3 figures, 6 algorithms)

This paper contains 26 sections, 15 theorems, 68 equations, 3 figures, 6 algorithms.

Introduction
Compute the optimal split for regression.
Compute the optimal split for classification.
Compute the optimal split for classification with categorical observations.
Extending to multiple attributes.
Massively parallel computation model.
Related work and comparison to our results.
Premilinaries and notation.
Handling deletions.
Paper organization.
Compute the Optimal Split for Regression in Data Streams
Exact algorithm.
Additive error approximation for bounded range.
Multiplicative errror approximation.
Compute the Optimal Split for Classification in Data Streams
...and 11 more sections

Key Result

Theorem 1

For regression, we have the following algorithms:

Figures (3)

Figure 1: The left figure is an example of regression. The optimal split is $j = 4$ which minimizes the mean squared error. The right figure is an example of classification. The optimal split is $j = 4$ which minimizes the misclassification rate.
Figure 2: Result summary. R: Regression, C: Classification, CC: classification with categorical attributes. 1: loss function based on misclassification rate, 2: loss function based on Gini impurity.
Figure 3: Result summary. R: Regression, C: Classification, CC: classification with categorical attributes. 1: loss function based on misclassification rate, 2: loss function based on Gini impurity.

Theorems & Definitions (32)

Theorem 1: Main Result 1
Theorem 2: Main Result 2
Theorem 3: Main Result 3
proof : Proof of Theorem \ref{['thm:main-regression']} (1)
Lemma 4
proof
Lemma 5
proof
proof : Proof of Theorem \ref{['thm:main-regression']} (2)
Lemma 6
...and 22 more

Constructing Decision Trees from Data Streams

TL;DR

Abstract

Constructing Decision Trees from Data Streams

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (32)