Making Pre-trained Language Models Great on Tabular Prediction
Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Z. Chen, Jimeng Sun, Jian Wu, Jintai Chen
TL;DR
This work tackles the challenge of transferring deep models to tabular prediction by tailoring language models to tabular data. It introduces TP-BERTa, a RoBERTa-based encoder augmented with Relative Magnitude Tokenization (RMT) for discretized numeric values and an Intra-Feature Attention (IFA) module to fuse feature names with values, enabling feature-aware representations. Pre-trained on 101 binary classification and 101 regression tabular datasets, TP-BERTa achieves state-of-the-art performance among tabular DNNs and remains competitive with Gradient Boosted Decision Trees across 145 downstream tasks; ablations show that RMT and IFA are critical for success. The results demonstrate substantial cross-table transferability and highlight the practical potential of LM-based tabular learners, especially when features contain meaningful categorical semantics.
Abstract
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. However, due to the heterogeneity among tables, such DNN bonus is still far from being well exploited on tabular data prediction (e.g., regression or classification tasks). Condensing knowledge from diverse domains, language models (LMs) possess the capability to comprehend feature names from various tables, potentially serving as versatile learners in transferring knowledge across distinct tables and diverse prediction tasks, but their discrete text representation space is inherently incompatible with numerical feature values in tables. In this paper, we present TP-BERTa, a specifically pre-trained LM for tabular data prediction. Concretely, a novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names. Comprehensive experiments demonstrate that our pre-trained TP-BERTa leads the performance among tabular DNNs and is competitive with Gradient Boosted Decision Tree models in typical tabular data regime.
