Table of Contents
Fetching ...

End-to-End Compression for Tabular Foundation Models

Guri Zabërgja, Rafiq Kamel, Arlind Kadra, Christian M. M. Frey, Josif Grabocka

TL;DR

This work tackles the scalability bottleneck of tabular foundation models whose inference cost grows with the size of the training context. It introduces TACO, an end-to-end compressor-predictor architecture that learns a compact latent representation of the training data and feeds it to a standard transformer-based predictor for in-context learning. Through joint training and a chunking strategy, TACO enables efficient inference on large tables with minimal accuracy loss, achieving substantial speedups and memory savings on the TabArena benchmark. The approach outperforms compression baselines and supports scaling to datasets with millions of rows, offering a practical path to deploying tabular foundation models in real-world settings.

Abstract

The long-standing dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97\% less memory compared to the state-of-the-art tabular transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.

End-to-End Compression for Tabular Foundation Models

TL;DR

This work tackles the scalability bottleneck of tabular foundation models whose inference cost grows with the size of the training context. It introduces TACO, an end-to-end compressor-predictor architecture that learns a compact latent representation of the training data and feeds it to a standard transformer-based predictor for in-context learning. Through joint training and a chunking strategy, TACO enables efficient inference on large tables with minimal accuracy loss, achieving substantial speedups and memory savings on the TabArena benchmark. The approach outperforms compression baselines and supports scaling to datasets with millions of rows, offering a practical path to deploying tabular foundation models in real-world settings.

Abstract

The long-standing dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97\% less memory compared to the state-of-the-art tabular transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.
Paper Structure (21 sections, 3 equations, 16 figures, 7 tables)

This paper contains 21 sections, 3 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: The architecture for training a joint end-to-end compressor and a predictor, with each model having its own parameters.
  • Figure 2: Fit + Predict time heatmap. Top: Predict times for TACO and POT, without using KV caching. Bottom: Predict times with KV caching. Black shaded regions indicate out-of-memory errors.
  • Figure 3: Cumulative prediction time for TACO at different compression rates $r$ compared to POT for 100 test batches. Top: Without using KV Caching, Bottom: Using KV Caching.
  • Figure 4: Critical difference (CD) diagram comparing predictor-only transformer (POT) and TACO. Average ranks are computed using Friedman–Nemenyi tests demsar2006statistical via autorankherbold2020autorank ($\alpha = 0.05$). Bars connect methods without significant differences.
  • Figure 5: Distribution of test performances on the TabArena classification tasks, for the: Left: joint training of the compressor and predictor, Right: training only compressor, while keeping the predictor frozen.
  • ...and 11 more figures