Table of Contents
Fetching ...

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

Mayuka Jayawardhana, Renbo, Samuel Dooley, Valeriia Cherepanova, Andrew Gordon Wilson, Frank Hutter, Colin White, Tom Goldstein, Micah Goldblum

TL;DR

This work tackles the mismatch between header-aware transformers and scalable gradient-boosted trees on tabular data. It introduces LLM-Boost and PFN-Boost, lightweight boosting schemes that seed GBDTs with transformer-derived predictions (LLMs or TabPFN) and learn residuals, enabling strong performance across small to large dataset sizes. Empirical results show state-of-the-art performance across many datasets, with PFN-Boost excelling on larger regimes and LLM-Boost shining on very small ones, while remaining effective midway via scaling. The approach is data-efficient, CPU-friendly for training, and open-sourced, highlighting a practical path to combine pretraining and textual priors with scalable tabular engines.

Abstract

Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .

Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

TL;DR

This work tackles the mismatch between header-aware transformers and scalable gradient-boosted trees on tabular data. It introduces LLM-Boost and PFN-Boost, lightweight boosting schemes that seed GBDTs with transformer-derived predictions (LLMs or TabPFN) and learn residuals, enabling strong performance across small to large dataset sizes. Empirical results show state-of-the-art performance across many datasets, with PFN-Boost excelling on larger regimes and LLM-Boost shining on very small ones, while remaining effective midway via scaling. The approach is data-efficient, CPU-friendly for training, and open-sourced, highlighting a practical path to combine pretraining and textual priors with scalable tabular engines.

Abstract

Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .

Paper Structure

This paper contains 30 sections, 2 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: How LLM-Boost works for a toy cat vs. dog classification problem. Note that here the selected nodes are denoted in light blue. The scaling parameter denoted by S allows for controlling the effect of the LLM predictions on the tree ensemble.
  • Figure 2: An few-shot prompt for the UCI adult income dataset designed to extract the LLM prediction scores required for LLM-Boost.
  • Figure 3: PFN-Boost, combining TabPFN and XGBoost outperforms ensemble baselines and standalone models across dataset sizes. Left: Average Z-score based on AUC performance across dataset sizes for PFN-Boost and other ensemble baselines. Right: Average AUC across dataset sizes.
  • Figure 4: LLM-Boost, combining Qwen-2.5-72B-Instruct and XGBoost, outperforms ensemble baselines and the standalone constituent models across small dataset sizes. Left: Average z-score based on AUC performance across dataset sizes for LLM-Boost and other ensemble baselines. Right: AUC performance across dataset sizes. Important Note: For this experiment, we always compute the LLM scores using a 3-shot prompt. Therefore, the LLM performance remains constant throughout all trainset sizes where the extra data is only used for GBDT training. The trough in LLM performance in the 100-500 trainset range is due to us using only a subset of datasets which have sufficient training samples, for these data points.
  • Figure 5: PFN-Boost outperforms LLM-Boost on larger datasets. Direct comparison of LLM-Boost with XGB+Qwen-2.5-72B-Instruct and PFN-Boost with XGB+TabPFN. We observe from this comparison that boosted TabPFN results are better except for on small dataset sizes. This is as expected as TabPFN itself is far superior to the standalone LLM on average. However, improved LLMs may in turn improve LLM-Boost.
  • ...and 10 more figures