Table of Contents
Fetching ...

Making Pre-trained Language Models Great on Tabular Prediction

Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Z. Chen, Jimeng Sun, Jian Wu, Jintai Chen

TL;DR

This work tackles the challenge of transferring deep models to tabular prediction by tailoring language models to tabular data. It introduces TP-BERTa, a RoBERTa-based encoder augmented with Relative Magnitude Tokenization (RMT) for discretized numeric values and an Intra-Feature Attention (IFA) module to fuse feature names with values, enabling feature-aware representations. Pre-trained on 101 binary classification and 101 regression tabular datasets, TP-BERTa achieves state-of-the-art performance among tabular DNNs and remains competitive with Gradient Boosted Decision Trees across 145 downstream tasks; ablations show that RMT and IFA are critical for success. The results demonstrate substantial cross-table transferability and highlight the practical potential of LM-based tabular learners, especially when features contain meaningful categorical semantics.

Abstract

The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.

Making Pre-trained Language Models Great on Tabular Prediction

TL;DR

This work tackles the challenge of transferring deep models to tabular prediction by tailoring language models to tabular data. It introduces TP-BERTa, a RoBERTa-based encoder augmented with Relative Magnitude Tokenization (RMT) for discretized numeric values and an Intra-Feature Attention (IFA) module to fuse feature names with values, enabling feature-aware representations. Pre-trained on 101 binary classification and 101 regression tabular datasets, TP-BERTa achieves state-of-the-art performance among tabular DNNs and remains competitive with Gradient Boosted Decision Trees across 145 downstream tasks; ablations show that RMT and IFA are critical for success. The results demonstrate substantial cross-table transferability and highlight the practical potential of LM-based tabular learners, especially when features contain meaningful categorical semantics.

Abstract

The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.
Paper Structure (22 sections, 6 equations, 7 figures, 14 tables)

This paper contains 22 sections, 6 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Illustrating the TP-BERTa workflow. "BP" in the input table denotes the feature name text "blood pressure". The rectangles with "B", "P", and "Gender" ("G") represent word embedding of "blood", "pressure", and "gender", respectively. In the RMT process, numerical values are discretized by the feature-specific C4.5 decision tree. In the IFA process, "MT#$i$" indicates the $i$-th magnitude token. All numerical features share these MT embeddings for magnitude representation. "MHSA" is a shared multi-head self-attention across all features for feature refinement.
  • Figure 2: Rank variation curve plots of several representative models with respect to variations of some feature type characteristics. Each point represents a set of datasets in a range of $\alpha$ or $\beta$.
  • Figure 3: Comparison of using regularization or not using it during finetuning on the non-pre-trained TP-BERTa. The validation AUC curves of several representative binary classification datasets show that the effect of the magnitude-aware triplet loss (see Eq. (\ref{['regloss']})) is to help quick convergence and avoid potential overfitting of TP-BERTa. In experiments, we use this regularization only in pre-training to smooth and accelerate the learning process.
  • Figure 4: The t-SNE visualization of 256 magnitude token embeddings before and after pre-training.
  • Figure 5: Rank change curve plots of several representative models with variations of data volume ($N$). We divide the datasets into two groups (the first column is for "$\beta > 0.1$" and the second column is for "$\beta \le 0.1$") to alleviate the impact from the feature type distributions. The split value 0.1 is chosen by keeping a roughly equal number of datasets in both groups.
  • ...and 2 more figures