Table of Contents
Fetching ...

TabICLv2: A better, faster, scalable, and open tabular foundation model

Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan

TL;DR

TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars, generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5.

Abstract

Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda-inria/tabicl, with synthetic data engine and pretraining code to follow.

TabICLv2: A better, faster, scalable, and open tabular foundation model

TL;DR

TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars, generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5.

Abstract

Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda-inria/tabicl, with synthetic data engine and pretraining code to follow.
Paper Structure (150 sections, 3 theorems, 92 equations, 69 figures, 1 table)

This paper contains 150 sections, 3 theorems, 92 equations, 69 figures, 1 table.

Key Result

Lemma 1.1

For $k \geq 0, m \geq 1, i \in \{0, \hdots, m-1\}$, define the set Then, if $m \geq 2^k$, we have $|I_{i,k,m} \cap I_{j,k,m}| \leq 1$ for all $0 \leq i < j \leq m-1$.

Figures (69)

  • Figure 1: Improvability vs. train time on TabArena tabarena. Improvability (lower is better) measures the relative error gap to the best method, averaged across datasets. Train time is training + inference in 8-fold cross-validation. For foundation models, it is dominated by forward passes that perform in-context learning. Default uses default hyperparameters; Tuned selects the best of 200 random hyperparameter configurations on validation; Tuned + Ens. applies post-hoc weighted ensemble of all configurations. The runtime of TabICLv2 is measured on an H100 GPU, while others are from TabArena. Results for inapplicable model-dataset pairs are imputed with default RandomForest.
  • Figure 2: Architecture of TabICLv2. Given an input $X \in \mathbb{R}^{n \times m}$, repeated feature grouping encodes columns into multiple groups via circular shifts to break feature symmetries, and target-aware embedding injects target information from the beginning. $\text{TF}_{\text{col}}$ embeds each feature through a set transformer, $\text{TF}_{\text{row}}$ aggregates features into row representations $h$, and $\text{TF}_{\text{icl}}$ performs in-context learning to predict test targets $\hat{y}$. QASSMax, our query-aware scalable softmax, is applied in the part of $\text{TF}_{\text{col}}$ where inducing points aggregate input information and $\text{TF}_{\text{icl}}$ to mitigate attention fading and improve long-context generalization.
  • Figure 3: SSMax variants mitigate attention fading in a synthetic 2D classification task. We create a dataset consisting of four negative clusters (C1--C4) and one anchor cluster containing a single anchor sample (triangle) in the training set. We increase negative samples while evaluating 20 fixed test samples (red squares) nearest to the anchor. (a) Attention entropy is divided by $\log n$ to ensure values in $(0,1)$ and averaged across all heads and layers in $\text{TF}_{\text{icl}}$, measuring how uniformly test samples attend to training ones. Without SSMax, accuracy drops and entropy rises as negative samples increase, which is a hallmark of attention fading where the model fails to focus on the relevant anchor. QASSMax maintains 100% accuracy with consistently low entropy. (b) shows decision boundaries at 1K and 15K negative samples. The region of the anchor cluster shrinks for all variants as negative samples increase. No SSMax collapses at 15K, while QASSMax preserves a stable boundary containing all test samples.
  • Figure 4: High-level structure of the synthetic dataset generation prior. Random vectors (one per sample) are propagated through a randomly generated graph where each node computes a random function of its parents. Columns of the final dataset are extracted from randomly assigned nodes. The resulting datasets can be rejected based on different filtering criteria. (d) List of the 8 random functions applied: (MLP) Multilayer perceptrons, (Tree Ensemble) Ensembles of symmetric trees inspired by CatBoost catboost, (Discretize) Discretization to nearest neighbors among a random set; (GP) Multivariate Gaussian process functions; (Linear) Linear functions; (Quadratic) Multivariate quadratic functions; (EM) functions with plateaus inspired by the cluster assignment in the EM algorithm; (Product) products of other random functions. (e) Examples of generated 2D classification datasets (cf. \ref{['fig:prior_datasets']}).
  • Figure 5: Improvability vs. inference time on TALENT talent. The runtime of TabICLv2 is measured on an H100 GPU, while other runtimes are taken from TALENT.
  • ...and 64 more figures

Theorems & Definitions (6)

  • Lemma 1.1: Intersections of feature groups
  • proof
  • Theorem 7.1: Smoothness of GP sample paths
  • proof
  • Lemma 7.2: Fourier characterization of Sobolev kernels
  • proof