UniPredict: Large Language Models are Universal Tabular Classifiers

Ruiyu Wang; Zifeng Wang; Jimeng Sun

UniPredict: Large Language Models are Universal Tabular Classifiers

Ruiyu Wang, Zifeng Wang, Jimeng Sun

TL;DR

UniPredict tackles the rigid target limitation of conventional tabular predictors by proposing a universal tabular modeling paradigm based on large language models. It trains a single GPT-2-based predictor on 169 diverse datasets, with prompt engineering and target augmentation to enable predictions for arbitrary targets, and validates performance against dataset-specific baselines. The framework achieves up to 13.4% relative gains over the best neural baselines and 5.4% over the top tree-boosting methods, while showing strong few-shot adaptability on 62 unseen datasets and robustness in low-resource settings. The work demonstrates the feasibility and value of universal, instruction-tuned tabular prediction at scale, and offers practical insights into metadata quality, context window limits, and feature value cleanliness for deployment.

Abstract

Tabular data prediction is a fundamental machine learning task for many applications. Existing methods predominantly employ discriminative modeling and operate under the assumption of a fixed target column, necessitating re-training for every new predictive task. Inspired by the generative power of large language models (LLMs), this paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict. Here, we demonstrate the scalability of an LLM to extensive tabular datasets, enabling it to comprehend diverse tabular inputs and predict target variables following the provided instructions. Specifically, we train a single LLM on an aggregation of 169 tabular datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately. We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline, respectively. We further test UniPredict in few-shot learning settings on another 62 tabular datasets. Our method achieves strong performance in quickly adapting to new tasks. In low-resource few-shot setup, we observed a 100%+ performance advantage compared with XGBoost, and significant margin over all baselines. We envision that UniPredict sheds light on developing a universal tabular data prediction system that learns from data at scale and serves a wide range of prediction tasks.

UniPredict: Large Language Models are Universal Tabular Classifiers

TL;DR

Abstract

Paper Structure (39 sections, 8 figures, 4 tables)

This paper contains 39 sections, 8 figures, 4 tables.

Introduction
Method and Implementation
Problem Formulation
Universal Tabular Modeling
Few-shot Learning
Prompt Engineering
Metadata Re-formatting
Feature Serialization
Instruction Formulation & Target Augmentation
Target Augmentation
Instruction Formulation
Learning
LLM for Tabular Prediction
Learning
Our Implementation of UniPredict
...and 24 more sections

Figures (8)

Figure 1: Visualization for three tabular modeling paradigms. Left: In Traditional Tabular Modeling tasks (Figure \ref{['fig:1a']}), distinct models are trained individually on each dataset, making them incapable of adaptation to new datasets with differing features and targets. Middle: In the In-Domain Tabular Modeling tasks (Figure \ref{['fig:1b']}), where flexibility is allowed for features, the targets remain the same across datasets. Right: the proposed Universal Tabular Modeling paradigm (Figure \ref{['fig:1c']}), which accommodates arbitrary inputs and predicting for arbitrary targets. This paradigm does not impose any restrictions on the domains of the datasets used. In Universal Tabular Modeling, the datasets can originate from entirely different domains.
Figure 2: The UniPredict framework. It consists of three steps: 1) Prompt Setup sets up the prompts by metadata, sample serialization, and instructions; 2) Target Augmentation transforms target values into categories with confidence estimates; and 3) Learning fine-tunes the backbone model by prompts and targets yielded from the previous procedures.
Figure 3: The average accuracy and rank of UniPredict-heavy, UniPredict-light, TabLLMhegselmann2023tabllmXGBoostchen2016xgboost, MLP, TabNetarik2021tabnet and FT-Transformergorishniy2021revisiting on 169 datasets. Each dot indicates a trial on one dataset. UniPredict-heavy demonstrates a remarkable performance advantage over the best neural network model (FT-Transformer) with a relative improvement of 13.4%. It also surpasses the best-performing tree-boosting algorithms by a margin of 5.4%. Our framework's advantage is further confirmed by Figure \ref{['fig:supervised-dataset-result-rank']}, the model ranking (the less the better)
Figure 4: The average accuracy and rank of UniPredict-heavy, UniPredict-light, TabLLMXGBoost, MLP, TabNet and FT-Transformer on 62 datasets. We vary the training data size, ranging from the lowest (10%) to the highest (90%) of the full dataset. The pre-trained UniPredict series exhibit remarkable data efficiency in generalizing to new tasks.
Figure 5: an overview of the causes for which either model (Figure \ref{['fig:5a']}), UniPredict-heavy (Figure \ref{['fig:5b']}), or UniPredict-light (Figure \ref{['fig:5c']}) experienced poor performance. As described in Section \ref{['sec:case-study']}, COL, FV, META and OTH stand for Excessive Column Number, Bad Feature Values, Bad Metadata and Other reasons, respectively. Among the 169 datasets examined, 8 datasets are included in UniPredict-heavy's investigation, with 12 causes identified. UniPredict-light fails on 10 datasets, with 11 causes identified.
...and 3 more figures

UniPredict: Large Language Models are Universal Tabular Classifiers

TL;DR

Abstract

UniPredict: Large Language Models are Universal Tabular Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (8)