Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification
Quangao Liu, Wei Yang, Chen Liang, Longlong Pang, Zhuozhang Zou
TL;DR
FT-TabPFN addresses the limitation of TabPFN on categorical features by adding a Feature Tokenization Layer and orthogonal regularization to produce richer, category-aware token representations. The method tokenizes numerical and categorical features into separate feature tokens, aggregates them into a sample embedding, and then fine-tunes on downstream tasks. Experiments on five datasets show FT-TabPFN achieving competitive or superior AUC-ROC (OVO) rankings compared to strong baselines, with the full configuration delivering the best performance. This work enhances the practical applicability of in-context tabular transformers to heterogeneous data and provides a path to better handling heterogeneous tabular data.
Abstract
Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.
