Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

David Holzmüller; Léo Grinsztajn; Ingo Steinwart

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

David Holzmüller, Léo Grinsztajn, Ingo Steinwart

TL;DR

The paper tackles the practical gap between gradient-boosted trees and neural nets on tabular data by developing RealMLP, a strengthened MLP, and strong tuned-defaults for both NN and GBDT pipelines. Using a large meta-training benchmark and a separate meta-test set, the authors demonstrate that RealMLP, with carefully designed preprocessing, architecture, and training regimes, achieves competitive time–accuracy performance with GBDTs, and that a mixture of RealMLP and GBDT defaults can yield excellent results without full hyperparameter optimization. They further show that some RealMLP enhancements transfer to TabR, improving its default performance, and that algorithm portfolios often outperform single-model HPO strategies. While tuned defaults generally transfer well and offer practical speedups, CatBoost defaults remain strong but slower, and the study highlights the importance of benchmarking choices. Overall, the work advocates using robust defaults across model families and exploiting ensemble or algorithm-selection strategies to achieve strong results on tabular data with limited tuning.

Abstract

For classification and regression on tabular data, the dominance of gradient-boosted decision trees (GBDTs) has recently been challenged by often much slower deep learning methods with extensive hyperparameter tuning. We address this discrepancy by introducing (a) RealMLP, an improved multilayer perceptron (MLP), and (b) strong meta-tuned default parameters for GBDTs and RealMLP. We tune RealMLP and the default parameters on a meta-train benchmark with 118 datasets and compare them to hyperparameter-optimized versions on a disjoint meta-test benchmark with 90 datasets, as well as the GBDT-friendly benchmark by Grinsztajn et al. (2022). Our benchmark results on medium-to-large tabular datasets (1K--500K samples) show that RealMLP offers a favorable time-accuracy tradeoff compared to other neural baselines and is competitive with GBDTs in terms of benchmark scores. Moreover, a combination of RealMLP and GBDTs with improved default parameters can achieve excellent results without hyperparameter tuning. Finally, we demonstrate that some of RealMLP's improvements can also considerably improve the performance of TabR with default parameters.

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

TL;DR

Abstract

Paper Structure (71 sections, 3 equations, 29 figures, 44 tables)

This paper contains 71 sections, 3 equations, 29 figures, 44 tables.

Introduction
Contribution
Related Work
Neural networks
Benchmarks
Better defaults
Meta-learning
Methodology
Benchmark Data Selection
Aggregate Benchmark Score
Improving Neural Networks
Data preprocessing
NN architecture
Initialization
Training
...and 56 more sections

Figures (29)

Figure 1: Components of RealMLP-TD. Part (c) shows the result of adding one component in each step, where the best default learning rate is found separately for each step. The vanilla MLP uses categorical embeddings, a quantile transform to preprocess numerical features, default PyTorch initialization, ReLU activation, early stopping, and is optimized with Adam with default parameters. For more details, see \ref{['sec:appendix:vanilla']}. The error bars are approximate 95% confidence intervals for the limit #splits $\to$$\infty$, see \ref{['sec:appendix:confidence_intervals']}.
Figure 2: Benchmark scores on all benchmarks vs. average training time. The $y$-axis shows the shifted geometric mean ($\operatorname{SGM}_\varepsilon$) classification error (left) or nRMSE (right) as explained in \ref{['sec:aggregate_metrics']}. The $x$-axis shows average training times per 1000 samples (measured on $\mathcal{B}^{\operatorname{train}}$ for efficiency reasons), see \ref{['sec:appendix:runtimes']}. The error bars are approximate 95% confidence intervals for the limit #splits $\to$$\infty$, see \ref{['sec:appendix:confidence_intervals']}. Note that XGB results on some (mainly meta-test) datasets are affected by a bug in handling rare categories, see \ref{['sec:appendix:experiments']}.
Figure 3: Benchmark scores vs. average training time for AUC. Methods labeled "no LS" deactivate label smoothing. Stopping and best-epoch selection are performed on accuracy, while HPO is performed on AUC. See \ref{['fig:pareto_auc-ovr_val-ce']} for stopping on cross-entropy. The $y$-axis shows the shifted geometric mean ($\operatorname{SGM}_\varepsilon$) $1-\mathrm{AUC}$ as explained in \ref{['sec:aggregate_metrics']}. The $x$-axis shows average training times per 1000 samples (measured on $\mathcal{B}^{\operatorname{train}}$ for efficiency reasons), see \ref{['sec:appendix:runtimes']}. The error bars are approximate 95% confidence intervals for the limit #splits $\to$$\infty$, see \ref{['sec:appendix:confidence_intervals']}.
Figure B.1: Effect of stopping patiences and metrics on the performance of GBDTs on $\mathcal{B}^{\operatorname{train}}_{\mathrm{class}}$. We run the XGB-TD, LGBM-TD, and CatBoost-TD with different early stopping patiences (early_stopping_rounds). We compare three different metrics used for stopping and best-epoch selection: classification error, Brier loss, and cross-entropy loss. The $y$-axis reports the relative increase in the benchmark score relative to stopping on classification error with patience $1000$ (i.e., never stopping early). The shaded areas are approximate 95% confidence intervals, cf. \ref{['sec:appendix:confidence_intervals']}.
Figure B.2: Effect of stopping patiences on the performance of GBDTs on $\mathcal{B}^{\operatorname{train}}_{\mathrm{reg}}$. We run the TD configurations of XGB, LGBM, and CatBoost with different early stopping patiences (early_stopping_rounds). As in the remainder of the paper, we use RMSE for early stopping and best-epoch selection. The $y$-axis reports the relative increase in the benchmark score relative to stopping on classification error with patience $1000$ (i.e., never stopping early). The shaded areas are approximate 95% confidence intervals, cf. \ref{['sec:appendix:confidence_intervals']}.
...and 24 more figures

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

TL;DR

Abstract

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

Authors

TL;DR

Abstract

Table of Contents

Figures (29)