Table of Contents
Fetching ...

Binary Split Categorical feature with Mean Absolute Error Criteria in CART

Peng Yu, Yike Chen, Chao Xu, Albert Bifet, Jesse Read

TL;DR

This work addresses MAE-based binary splits for categorical features in the CART family, showing that no data-independent unsupervised encoding can be optimal and that median-based heuristics can be significantly suboptimal. It introduces an exact, encoding-free algorithm that reduces MAE split computation to a Unimodal Cost 2-Median problem and achieves runtime $O(n \log n + k\log k\log n)$ by exploiting total-monotonicity and a specialized data structure for piecewise-linear functions. Theoretical results include a nonexistence proof for encoding-based solutions and a reduction to UC2M, while experiments on OpenML and Kaggle datasets demonstrate exact MAE-optimal splits with strong scaling and improved performance on imbalanced data. Overall, the method enhances categorical data handling in CART and offers techniques that could generalize to other unimodal-cost optimization problems in machine learning and computational geometry.

Abstract

In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for the MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.

Binary Split Categorical feature with Mean Absolute Error Criteria in CART

TL;DR

This work addresses MAE-based binary splits for categorical features in the CART family, showing that no data-independent unsupervised encoding can be optimal and that median-based heuristics can be significantly suboptimal. It introduces an exact, encoding-free algorithm that reduces MAE split computation to a Unimodal Cost 2-Median problem and achieves runtime by exploiting total-monotonicity and a specialized data structure for piecewise-linear functions. Theoretical results include a nonexistence proof for encoding-based solutions and a reduction to UC2M, while experiments on OpenML and Kaggle datasets demonstrate exact MAE-optimal splits with strong scaling and improved performance on imbalanced data. Overall, the method enhances categorical data handling in CART and offers techniques that could generalize to other unimodal-cost optimization problems in machine learning and computational geometry.

Abstract

In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for the MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.

Paper Structure

This paper contains 18 sections, 9 theorems, 9 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

No numerical encoding function works for binary split with MAE splitting criteria.

Figures (6)

  • Figure 1: A example where $\mathcal{Y}=\{Y_1,Y_2,Y_3,Y_4\}$, and the encoding function $e$ maps elements in $\mathcal{Y}$ to $\mathbb{R}$. There are $5$ downward closed sets, and $3$ of them splits, so $D(\mathcal{Y},e) = \{\{Y_1\},\{Y_1,Y_3\},\{Y_1,Y_3,Y_2\}\}$.
  • Figure 2: Find the optimum value of the MAE criteria.
  • Figure 3: Intuition of the $\dagger$ transform.
  • Figure 4: A demonstration of one step of the recursion algorithm.
  • Figure 5: Algorithm for finding unimodal 2 medians. For each one of the procedures, the input $f$ is a sequence of piecewise-linear unimodal functions.
  • ...and 1 more figures

Theorems & Definitions (16)

  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • ...and 6 more