CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim; Léo Grinsztajn; Gaël Varoquaux

CARTE: Pretraining and Transfer for Tabular Learning

Myung Jun Kim, Léo Grinsztajn, Gaël Varoquaux

TL;DR

CARTE introduces a graph-based framework for tabular learning that does not require explicit schema or entity matching to transfer knowledge across tables. It pretrains on a large knowledge base by constructing graphlets from a knowledge graph and employs a graph-attentional transformer with a contrastive loss, then fine-tunes efficiently on downstream tasks. Across 51 diverse datasets, CARTE outperforms a wide set of baselines on single-table learning and demonstrates robust cross-table transfer, even when columns or entries are unmatched. This work opens the door to tabular foundation models by enabling open-vocabulary, context-aware representations that scale across heterogeneous tabular data sources. It also highlights trade-offs in computation and emphasizes the importance of string-level representations in tabular data learning.

Abstract

Pretrained deep-learning models are the go-to solution for images or text. However, for tabular data the standard is still to train tree-based models. Indeed, transfer learning on tables hits the challenge of data integration: finding correspondences, correspondences in the entries (entity matching) where different words may denote the same entity, correspondences across columns (schema matching), which may come in different orders, names... We propose a neural architecture that does not need such correspondences. As a result, we can pretrain it on background data that has not been matched. The architecture -- CARTE for Context Aware Representation of Table Entries -- uses a graph representation of tabular (or relational) data to process tables with different columns, string embedding of entries and columns names to model an open vocabulary, and a graph-attentional network to contextualize entries with column names and neighboring entries. An extensive benchmark shows that CARTE facilitates learning, outperforming a solid set of baselines including the best tree-based models. CARTE also enables joint learning across tables with unmatched columns, enhancing a small table with bigger ones. CARTE opens the door to large pretrained models for tabular data.

CARTE: Pretraining and Transfer for Tabular Learning

TL;DR

Abstract

Paper Structure (55 sections, 3 equations, 12 figures, 6 tables)

This paper contains 55 sections, 3 equations, 12 figures, 6 tables.

Introduction
Related Works
Tabular deep learning
Transfer learning for tabular data
Pretrained models for tabular data
Discrete entries
Data integration
The CARTE Model to Learn Across Tables
Graph Representation of Table Entities
Pretrained Model from a Large Knowledge Base
Graphlets for pretraining
Batch samples
Model architecture
Contrastive loss
Fine-tuning for Downstream Tasks
...and 40 more sections

Figures (12)

Figure 1: Graphlet representation of tabular entities. From a table, CARTE represents each row as a star-like graph. Excluding for missing values, the leaf-nodes and the edges are annotated by the cell values and their corresponding column names. Then, CARTE initializes the features of each with a language model. The nodes of numerical values are initialized by the elementwise product with its corresponding column feature. For the center node, it is initially set as the average of the leaflets. It will later act as a readout that captures the overall information of the graphlet.
Figure 2: CARTE pretraining process. From a large knowledge graph, CARTE begins by constructing graphlets and their positives variants. The extracted samples are then fed into the CARTE neural network and trained with a self-supervised scheme. The neural network learns to aggregate information within the graphlets, which reflect the combination of table entries across columns (edges).
Figure 3: CARTE architecture The inputs of CARTE are graphs that contain node ($X$) and edge ($E$) features, both used in self-attention layers (shown in grey). The attention layers update node features using the context embodied with the edge information; the graph structure of the input is reflected by attention masks. The Aggregate & Readout layer consists of the attention layer (without the edge update) followed by feature extraction on the center node. The outputs are then processed for the contrastive loss.
Figure 4: CARTE performs best for learning on single tables. Learning curve on (a) regression and (b) classification tasks. Top: normalized score (1 is the best performer across all methods and train size for a dataset, and 0 the worst), averaged across datasets. Bottom: critical difference diagrams Terpilowski2019, for all train sizes. \ref{['fig:comparison_all']} gives critical difference diagram for all methods studied.
Figure 5: Entity matching not required for CARTE, and downstream entities do not need to be in YAGO. We evaluate CARTE and KEN either on the full datasets, or on a reduced version of the datasets corresponding to entities present in YAGO. In addition, when entities are present in YAGO, we either match them to their canonical names in YAGO (blue) or keep the original names (orange). When KEN is used to enrich the dataset, CatBoost is used as the estimator, and entities without matching are replaced with missing values. Each point on the figure correspond to an improvement in performance with respect to Catboost without any enrichment. That KEN brings performance gains to CatBoost on YAGO entities confirms the added value of background information. Appendix \ref{['app:matching']} gives detailed results.
...and 7 more figures

CARTE: Pretraining and Transfer for Tabular Learning

TL;DR

Abstract

CARTE: Pretraining and Transfer for Tabular Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)