Table of Contents
Fetching ...

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science

Yazheng Yang, Yuqi Wang, Guang Liu, Ledell Wu, Qi Liu

TL;DR

The paper tackles universal pretraining for tabular data with diverse schemas by introducing UniTabE, a TabUnit-based architecture that feeds per-cell representations into a Transformer encoder and uses a shallow decoder with free-form prompts for broad pretraining and finetuning. It trains on a massive ~13B-table Kaggle corpus and employs multi-cell masking and contrastive learning to learn rich tabular representations, achieving strong performance on classification and regression benchmarks, zero-shot settings, and incremental-column scenarios. Empirical results show UniTabE outperforms baselines including XGBoost across many tasks, and ablations highlight the importance of the fuse and linking layers as well as the self-supervised objectives. The work demonstrates the feasibility and practical impact of large-scale tabular pretraining for data science, enabling scalable downstream adaptation and potential synergy with traditional models like XGBoost.

Abstract

Recent advancements in NLP have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to facilitating the prediction over tables in data science, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the establishment of a universal pretraining protocol for tables with varied structures, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a straightforward yet effective method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform. This research primarily centers on classification and regression tasks involving tabular data, and conducts rigorous experimental testing and analyses to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baselines across massive benchmarks. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride for tabular data analysis.

UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science

TL;DR

The paper tackles universal pretraining for tabular data with diverse schemas by introducing UniTabE, a TabUnit-based architecture that feeds per-cell representations into a Transformer encoder and uses a shallow decoder with free-form prompts for broad pretraining and finetuning. It trains on a massive ~13B-table Kaggle corpus and employs multi-cell masking and contrastive learning to learn rich tabular representations, achieving strong performance on classification and regression benchmarks, zero-shot settings, and incremental-column scenarios. Empirical results show UniTabE outperforms baselines including XGBoost across many tasks, and ablations highlight the importance of the fuse and linking layers as well as the self-supervised objectives. The work demonstrates the feasibility and practical impact of large-scale tabular pretraining for data science, enabling scalable downstream adaptation and potential synergy with traditional models like XGBoost.

Abstract

Recent advancements in NLP have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to facilitating the prediction over tables in data science, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the establishment of a universal pretraining protocol for tables with varied structures, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a straightforward yet effective method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform. This research primarily centers on classification and regression tasks involving tabular data, and conducts rigorous experimental testing and analyses to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baselines across massive benchmarks. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride for tabular data analysis.
Paper Structure (22 sections, 9 equations, 5 figures, 10 tables)

This paper contains 22 sections, 9 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: The left part delineates the operational procedure of TabUnit module in processing individual cells. The right part provides an overview of our UniTabE architecture. "n" denotes the number of cells in each example, "Q" denotes the length of prompt, while "T" here represents the length of target. A shallow decoder is applied offering adaptability to a spectrum of diverse downstream tasks.
  • Figure 2: Demonstration of contrastive learning. (B1, B2) is positive pair, while (B1, B3) and (B1, B4) are negative pairs.
  • Figure 3: Distribution visualization. The left part (a) demonstrates the distribution of domains and the number of tables in each domain. Please magnify the figure as some captions are small. The right part shows the proportion (cell level) of different data types in train/dev/test splits.
  • Figure 4: BLEU scores illustrating the generation of textual values across various model sizes and dataset sizes.
  • Figure 5: Demonstration of multi-cell-masking.