Table of Contents
Fetching ...

PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Marco Spinaci, Marek Polewczyk, Johannes Hoffart, Markus C. Kohler, Sam Thelin, Tassilo Klein

TL;DR

PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing, which offers a practical advancement in self-supervised learning for large-scale tabular data.

Abstract

Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.

PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

TL;DR

PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing, which offers a practical advancement in self-supervised learning for large-scale tabular data.

Abstract

Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.

Paper Structure

This paper contains 13 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Schematic illustration of PORTAL architecture: Example on a row with 4 columns. Blue cells denote custom encodings/decodings for date (day, month, year, day of the week, and holidays). Green cells correspond to numbers (sign, fraction, and exponent), and gray cells correspond to text embeddings. Dark gray cells are column name embeddings. All values are processed via a trainable linear/embedding layer before being aggregated by sum. In the output layer, similar decoding layers are applied before feeding the outputs to cross-entropy or Huber loss in training (see \ref{['sec:decoding']} for details).
  • Figure 2: Epoch-wise pre-training performance: Validation metrics per epochs for the number and text head during pre-training.
  • Figure 3: Performance analysis by model size and count: Top: Effect of bagging on $R^2$ score and accuracy: this experiment was conducted by selecting top $n$ performing models (based on validation $R^2$ scores and accuracy) from a single batch of 10 runs. Bottom: Performance of models of different sizes, trained on the full training datasets (using patience = 10 epochs and, for regression, predicting $\tilde{\alpha}$ with binary cross-entropy loss)