Table of Contents
Fetching ...

TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting

Andrei Margeloiu, Adrián Bazaga, Nikola Simidjievski, Pietro Liò, Mateja Jamnik

TL;DR

TabMDA introduces a training-free manifold data augmentation framework for tabular data by leveraging in-context learning in pre-trained tabular transformers. It generates diverse label-invariant embeddings through in-context subsetting (ICS) and trains downstream classifiers on the augmented embedding space, achieving significant accuracy gains and reduced variance across multiple datasets and models. The method demonstrates practical benefits for small-to-medium tabular tasks and enables competitive performance for explainable classifiers like KNN, while highlighting dependencies on the pre-trained encoder's priors. Potential limitations include privacy concerns from requiring access to the full training set and reliance on the quality of the embedding space, with future work focusing on encoder distillation and deeper analysis of the learned manifolds.

Abstract

Tabular data is prevalent in many critical domains, yet it is often challenging to acquire in large quantities. This scarcity usually results in poor performance of machine learning models on such data. Data augmentation, a common strategy for performance improvement in vision and language tasks, typically underperforms for tabular data due to the lack of explicit symmetries in the input space. To overcome this challenge, we introduce TabMDA, a novel method for manifold data augmentation on tabular data. This method utilises a pre-trained in-context model, such as TabPFN, to map the data into an embedding space. TabMDA performs label-invariant transformations by encoding the data multiple times with varied contexts. This process explores the learned embedding space of the underlying in-context models, thereby enlarging the training dataset. TabMDA is a training-free method, making it applicable to any classifier. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various tabular datasets. Our results demonstrate that TabMDA provides an effective way to leverage information from pre-trained in-context models to enhance the performance of downstream classifiers. Code is available at https://github.com/AdrianBZG/TabMDA.

TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting

TL;DR

TabMDA introduces a training-free manifold data augmentation framework for tabular data by leveraging in-context learning in pre-trained tabular transformers. It generates diverse label-invariant embeddings through in-context subsetting (ICS) and trains downstream classifiers on the augmented embedding space, achieving significant accuracy gains and reduced variance across multiple datasets and models. The method demonstrates practical benefits for small-to-medium tabular tasks and enables competitive performance for explainable classifiers like KNN, while highlighting dependencies on the pre-trained encoder's priors. Potential limitations include privacy concerns from requiring access to the full training set and reliance on the quality of the embedding space, with future work focusing on encoder distillation and deeper analysis of the learned manifolds.

Abstract

Tabular data is prevalent in many critical domains, yet it is often challenging to acquire in large quantities. This scarcity usually results in poor performance of machine learning models on such data. Data augmentation, a common strategy for performance improvement in vision and language tasks, typically underperforms for tabular data due to the lack of explicit symmetries in the input space. To overcome this challenge, we introduce TabMDA, a novel method for manifold data augmentation on tabular data. This method utilises a pre-trained in-context model, such as TabPFN, to map the data into an embedding space. TabMDA performs label-invariant transformations by encoding the data multiple times with varied contexts. This process explores the learned embedding space of the underlying in-context models, thereby enlarging the training dataset. TabMDA is a training-free method, making it applicable to any classifier. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various tabular datasets. Our results demonstrate that TabMDA provides an effective way to leverage information from pre-trained in-context models to enhance the performance of downstream classifiers. Code is available at https://github.com/AdrianBZG/TabMDA.
Paper Structure (12 sections, 1 equation, 8 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 1 equation, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: TabMDA improves tabular data classifiers by jointly embedding and augmenting the dataset in the embedding space of pre-trained tabular transformers using in-context learning. It generates multiple embeddings for each input by presenting different contexts to the encoder, leveraging its in-context learning capability. This results in an expanded training dataset with more diverse samples, enhancing the accuracy and robustness of the downstream predictor. TabMDA is training-free and can be applied to any classifier.
  • Figure 2: PCA linear projection of the first two PCs of the raw input data space for the "vehicle" dataset (left), the manifold space using the encoder from hollmann2023tabpfn(middle), and the augmented manifold space after using our proposed method (right). The colour of the data points represents one of four class labels. Visualisations for all datasets are available in \ref{['appendix:visualization_all']}.
  • Figure 3: Average accuracy (%) for five downstream classifiers trained on real data (the original input space) and TabMDA embeddings. We report the mean$\pm$std of test balanced accuracy over 10 runs for each predictor, totalling 50 runs. Training with TabMDA substantially improves performance across five real-world tabular datasets and reduces variability among classifiers. It also performs well on artificial datasets, such as "fourier", which are processed images, despite differing from the inductive bias of the in-context encoder, TabPFN.
  • Figure 4: PCA linear projection of the first 2 PCs of the raw input data space for the "protein" dataset (left), the manifold space using the encoder from hollmann2023tabpfn(middle) and the augmented manifold space after using our proposed method (right). The colour of the data points depicts class label.
  • Figure 5: PCA linear projection of the first 2 PCs of the raw input data space for the "biodeg" dataset (left), the manifold space using the encoder from hollmann2023tabpfn(middle) and the augmented manifold space after using our proposed method (right). The colour of the data points depicts class label.
  • ...and 3 more figures