Table of Contents
Fetching ...

ConTextTab: A Semantics-Aware Tabular In-Context Learner

Marco Spinaci, Marek Polewczyk, Maximilian Schambach, Sam Thelin

TL;DR

ConTextTab introduces a semantics-aware, table-native in-context learner trained on real-world tabular data to unify semantic understanding with structural efficiency. By employing modality-specific embeddings (text, date, numeric, and column headers) and an interleaved cross-row/cross-column Transformer backbone, it achieves competitive performance across standard tabular benchmarks and sets a new standard on the CARTE benchmark in low-data and semantically rich settings. The model demonstrates the importance of semantic representations for both features and column names and shows that context size and training data diversity are key levers for performance and scalability. While outperforming existing table-native ICL approaches, ConTextTab also highlights ongoing challenges in scaling to very large datasets and longer contexts, underscoring the need for broader real-world data and further architectural innovations.

Abstract

Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.

ConTextTab: A Semantics-Aware Tabular In-Context Learner

TL;DR

ConTextTab introduces a semantics-aware, table-native in-context learner trained on real-world tabular data to unify semantic understanding with structural efficiency. By employing modality-specific embeddings (text, date, numeric, and column headers) and an interleaved cross-row/cross-column Transformer backbone, it achieves competitive performance across standard tabular benchmarks and sets a new standard on the CARTE benchmark in low-data and semantically rich settings. The model demonstrates the importance of semantic representations for both features and column names and shows that context size and training data diversity are key levers for performance and scalability. While outperforming existing table-native ICL approaches, ConTextTab also highlights ongoing challenges in scaling to very large datasets and longer contexts, underscoring the need for broader real-world data and further architectural innovations.

Abstract

Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/sap-rpt-1-oss.

Paper Structure

This paper contains 22 sections, 2 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our proposed model architecture illustrating the integration of data type-specific embeddings, an interleaved attention backbone, and customized output heads.
  • Figure 2: Left: critical difference diagram between ConTextTab and several baselines, across the CARTE benchmark. Right: Impact of pretraining dataset size on validation accuracy and $R^2$ scores.
  • Figure 3: Average rank, accuracy, and regression results on the CARTE benchmark across various data subsets, ranging from 128 rows to the full size.
  • Figure 4: Relation between number of training dataset rows and performance, obtained as a LOWESS regression in the plane $\log(n_{\mathrm{rows}}, \textnormal{rank})$. The confidence bands are the 80% confidence intervals obtained via bootstrapping.
  • Figure 5: Win ratio confusion matrix and average of the investigated models across all 203 datasets. Wins are calculated based on accuracy on classification and $R^2$ on regression datasets. Ties are not counted as wins. Models are sorted by descending overall rank.
  • ...and 8 more figures