ExcelFormer: A neural network surpassing GBDTs on tabular data

Jintai Chen; Jiahuan Yan; Qiyuan Chen; Danny Ziyi Chen; Jian Wu; Jimeng Sun

ExcelFormer: A neural network surpassing GBDTs on tabular data

Jintai Chen, Jiahuan Yan, Qiyuan Chen, Danny Ziyi Chen, Jian Wu, Jimeng Sun

TL;DR

ExcelFormer targets the persistent challenges of tabular data prediction by combining a semi-permeable attention mechanism with an interaction-attenuated initialization and a GLU-based attentive FFN, complemented by two tabular-specific data augmentations, Feat-Mix and Hid-Mix. This design disrupts rotational invariance and improves data efficiency while maintaining a model size comparable to existing tabular Transformers. Across 96 small and 21 large real-world datasets, ExcelFormer consistently outperforms GBDTs and prior DNNs using default parameters and with limited tuning, demonstrating both strong accuracy and practical usability. The work shows that careful architectural choices and augmentation strategies can yield a robust, user-friendly “sure bet” solution for diverse tabular prediction tasks, reducing the need for extensive hyperparameter search and model selection by casual users.

Abstract

Data organized in tabular format is ubiquitous in real-world applications, and users often craft tables with biased feature definitions and flexibly set prediction targets of their interests. Thus, a rapid development of a robust, effective, dataset-versatile, user-friendly tabular prediction approach is highly desired. While Gradient Boosting Decision Trees (GBDTs) and existing deep neural networks (DNNs) have been extensively utilized by professional users, they present several challenges for casual users, particularly: (i) the dilemma of model selection due to their different dataset preferences, and (ii) the need for heavy hyperparameter searching, failing which their performances are deemed inadequate. In this paper, we delve into this question: Can we develop a deep learning model that serves as a "sure bet" solution for a wide range of tabular prediction tasks, while also being user-friendly for casual users? We delve into three key drawbacks of deep tabular models, encompassing: (P1) lack of rotational variance property, (P2) large data demand, and (P3) over-smooth solution. We propose ExcelFormer, addressing these challenges through a semi-permeable attention module that effectively constrains the influence of less informative features to break the DNNs' rotational invariance property (for P1), data augmentation approaches tailored for tabular data (for P2), and attentive feedforward network to boost the model fitting capability (for P3). These designs collectively make ExcelFormer a "sure bet" solution for diverse tabular datasets. Extensive and stratified experiments conducted on real-world datasets demonstrate that our model outperforms previous approaches across diverse tabular data prediction tasks, and this framework can be friendly to casual users, offering ease of use without the heavy hyperparameter tuning.

ExcelFormer: A neural network surpassing GBDTs on tabular data

TL;DR

Abstract

Paper Structure (31 sections, 9 equations, 5 figures, 15 tables)

This paper contains 31 sections, 9 equations, 5 figures, 15 tables.

Introduction
Related Work
Supervised Tabular Data Prediction
Mixup and other Data Augmentations
ExcelFormer
Solving (P1) with Semi-Permeable Attention
Interaction Attenuated Initialization.
Solving (P3) with GLU layer
The rest part of ExcelFormer
Feature Pre-processing.
Prediction Head.
Solving (P2) with Data Augmentation
Training and Loss Functions
Experiments
Experimental Setups
...and 16 more sections

Figures (5)

Figure 1: Performance variation with different percentages of training samples on three large-scale tabular datasets (total training sample count in parentheses). DNNs often exhibit better performance with larger training sample sizes, whereas GBDTs perform better when data is scarce.
Figure 2: An illustration of our proposed ExcelFormer model. Diverse feature types are pre-processed procedure as in rubachev2022revisiting, followed by quantile transformation and embedding layer to convert them into numerical embeddings. Each feature embedding $z_i\in \mathbb{R}^d$ serves as a token in ExcelFormer.
Figure 3: $k$NN ($k=8$) decision boundaries with 2 key features of a zoomed-in part of the Higgs dataset. Convex combinations by vanilla Mixup (points on the black line) of 2 samples $x_1$ and $x_2$ may conflict with irregular category boundaries.
Figure 4: Examples for the Hid-Mix and Feat-Mix, where "emb." means "embeding" dimension.
Figure 5: Model performances under random dataset rotations. Test accuracy scores have been normalized across datasets, and the boxes represent the distribution of scores across 20 random seeds. XGb.: XGboost, Cat.: Catboost, EXC.: ExcelFormer without data augmentation.

ExcelFormer: A neural network surpassing GBDTs on tabular data

TL;DR

Abstract

ExcelFormer: A neural network surpassing GBDTs on tabular data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)