Tree-Regularized Tabular Embeddings

Xuan Li; Yun Wang; Bo Li

Tree-Regularized Tabular Embeddings

Xuan Li, Yun Wang, Bo Li

TL;DR

Through quantitative experiments on 88 OpenML datasets with binary classification task, it is validated that the proposed tree-regularized representation not only tapers the difference with respect to tree-based models, but also achieves on-par and better performance when compared with advanced NN models.

Abstract

Tabular neural network (NN) has attracted remarkable attentions and its recent advances have gradually narrowed the performance gap with respect to tree-based models on many public datasets. While the mainstreams focus on calibrating NN to fit tabular data, we emphasize the importance of homogeneous embeddings and alternately concentrate on regularizing tabular inputs through supervised pretraining. Specifically, we extend a recent work (DeepTLF) and utilize the structure of pretrained tree ensembles to transform raw variables into a single vector (T2V), or an array of tokens (T2T). Without loss of space efficiency, these binarized embeddings can be consumed by canonical tabular NN with fully-connected or attention-based building blocks. Through quantitative experiments on 88 OpenML datasets with binary classification task, we validated that the proposed tree-regularized representation not only tapers the difference with respect to tree-based models, but also achieves on-par and better performance when compared with advanced NN models. Most importantly, it possesses better robustness and can be easily scaled and generalized as standalone encoder for tabular modality. Codes: https://github.com/milanlx/tree-regularized-embedding.

Tree-Regularized Tabular Embeddings

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 11 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 1 equation, 11 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Heterogeneity in tabular emebddings
Tabular NN models and pretraining
Towards Data-Centric Tabular Learning
notations
Synthetic Experiments
Tree-regularized Embedding
supervised tree-regularized embeddings
Experiments
Datasets, models, and training details
Performance Evaluation
robustness
full-scale comparison
partial-scale comparison
...and 2 more sections

Figures (11)

Figure 1: An overview of data-centric tabular learning
Figure 2: Comparison between MLP and XGB with varying $\beta$ in terms of accuracy
Figure 3: A visualization of the synthetic data in 2D scenario
Figure 4: Overview of tree-to-vector (T2V) embedding
Figure 5: Pseudocode of in-batch transformation for T2V in a PyTorch-like style. Step 1 replaces pairwise comparison with matrix manipulation, while Step 2 showcases on-the-fly transformations for mini-batch implemented through the $\mathrm{tranforms.Compose}$ module in PyTorch.
...and 6 more figures

Tree-Regularized Tabular Embeddings

TL;DR

Abstract

Tree-Regularized Tabular Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (11)