Table of Contents
Fetching ...

Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

Quangao Liu, Wei Yang, Chen Liang, Longlong Pang, Zhuozhang Zou

TL;DR

FT-TabPFN addresses the limitation of TabPFN on categorical features by adding a Feature Tokenization Layer and orthogonal regularization to produce richer, category-aware token representations. The method tokenizes numerical and categorical features into separate feature tokens, aggregates them into a sample embedding, and then fine-tunes on downstream tasks. Experiments on five datasets show FT-TabPFN achieving competitive or superior AUC-ROC (OVO) rankings compared to strong baselines, with the full configuration delivering the best performance. This work enhances the practical applicability of in-context tabular transformers to heterogeneous data and provides a path to better handling heterogeneous tabular data.

Abstract

Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.

Tokenize features, enhancing tables: the FT-TABPFN model for tabular classification

TL;DR

FT-TabPFN addresses the limitation of TabPFN on categorical features by adding a Feature Tokenization Layer and orthogonal regularization to produce richer, category-aware token representations. The method tokenizes numerical and categorical features into separate feature tokens, aggregates them into a sample embedding, and then fine-tunes on downstream tasks. Experiments on five datasets show FT-TabPFN achieving competitive or superior AUC-ROC (OVO) rankings compared to strong baselines, with the full configuration delivering the best performance. This work enhances the practical applicability of in-context tabular transformers to heterogeneous data and provides a path to better handling heterogeneous tabular data.

Abstract

Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.
Paper Structure (15 sections, 3 equations, 6 figures, 2 tables)

This paper contains 15 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A brief visualisation of TabPFN. There are three support samples and two query samples. The x and y of the support samples are embedded together in a vector. The x of the query samples are embedded in a vector. The y of the query samples are waiting for prediction.
  • Figure 2: Input Embeding module of TabPFN. The feature vector x of query samples is expanded from its original dimensions to 100 dimensions through zero-padding, and then mapped to a 512-dimensional space via a linear layer liner(100,512) to form the query sample embedding. Support samples undergo a similar process for their x. However, the y of support samples is mapped to the same 512-dimensional space independently through another linear layer liner(1,512). Finally, the embeddings of x and y for support samples are added together to form the final support sample embedding.
  • Figure 3: Feature Tokenization Layer. This innovative layer processes continuous and categorical features distinctly, transforming them into feature tokens which are then aggregated to construct the final sample embedding.
  • Figure 4: Average training and test logs over all datasets. These graphs show the average training and testing metrics for all datasets, comparing FT-TabPFN$_{n.i.}$, FT-TabPFN$_{n.r.}$, and FT-TabPFN.
  • Figure 5: Category Tokens Correlation Heatmaps. Each subgraph is the inner product matrix of all category tokens under the classification characteristics of each data set. These heatmaps illustrate the inner product matrices of category tokens under different configurations, showing changes in category tokens similarities.
  • ...and 1 more figures