Table of Contents
Fetching ...

TabularFM: An Open Framework For Tabular Foundational Models

Quan M. Tran, Suong N. Hoang, Lam M. Nguyen, Dzung Phan, Hoang Thanh Lam

TL;DR

This work addresses the under-explored area of tabular foundational models by introducing TabularFM, an open framework that pretrains FMs on curated tabular datasets and benchmarks their transferability. It combines state-of-the-art generative approaches—CTGAN, TVAE, STVAE, STVAEM, and GReaT—on large-scale Kaggle and GitTables data, providing pretrained models and leaderboards. The findings show pretrained tabular FMs transfer across domains, with CTGAN/TVAE consistently outperforming from-scratch training and text-pretrained transformers offering strong performance under certain conditions. Overall, TabularFM advances reproducible evaluation and rapid benchmarking for tabular foundations, paving the way for broader data-type coverage and larger-scale pretraining.

Abstract

Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.

TabularFM: An Open Framework For Tabular Foundational Models

TL;DR

This work addresses the under-explored area of tabular foundational models by introducing TabularFM, an open framework that pretrains FMs on curated tabular datasets and benchmarks their transferability. It combines state-of-the-art generative approaches—CTGAN, TVAE, STVAE, STVAEM, and GReaT—on large-scale Kaggle and GitTables data, providing pretrained models and leaderboards. The findings show pretrained tabular FMs transfer across domains, with CTGAN/TVAE consistently outperforming from-scratch training and text-pretrained transformers offering strong performance under certain conditions. Overall, TabularFM advances reproducible evaluation and rapid benchmarking for tabular foundations, paving the way for broader data-type coverage and larger-scale pretraining.

Abstract

Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.
Paper Structure (48 sections, 31 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 48 sections, 31 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: TSNE representation of top 10 domains clustered by $k$-means algorithm. Domain names are manually labeled by human by looking at the cluster keywords.
  • Figure 2: Wordclouds demonstrate columns with high transferability (a) and low transferability (b).
  • Figure 3: The plots illustrate the column pairs with the most significant differences between the pair trends of the pretrained models and the model trained from scratch in (a), and vice versa in (b).
  • Figure 4: Column shape comparison of STVAE Pretrained prediction and STVAE from scratch.
  • Figure 5: Training and validation loss of STVAE models when they are pretrained versus trained from scratch. Pretrained STVAE converge faster toward more optimal solutions.
  • ...and 5 more figures