Table of Contents
Fetching ...

Why Tabular Foundation Models Should Be a Research Priority

Boris van Breugel, Mihaela van der Schaar

TL;DR

The paper argues that tabular data, despite its ubiquity in science and industry, remains underserved by foundation-model research. It defines Large Tabular Models (LTMs) and outlines design desiderata, data needs, and benchmarking approaches to enable scalable, adaptable tabular foundation models. It surveys current LTMs, analyzes the challenges of tabular generation with LLMs, and highlights real-world impacts in responsible AI, science, and data democratization. The authors advocate shifting resources toward LTMs, emphasizing practical pathways for development, evaluation, and responsible deployment, with potential to transform how tabular data is processed and reused across disciplines.

Abstract

Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power. We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM). LTMs could revolutionise the way science and ML use tabular data: not as single datasets that are analyzed in a vacuum, but contextualized with respect to related datasets. The potential impact is far-reaching: from few-shot tabular models to automating data science; from out-of-distribution synthetic data to empowering multidisciplinary scientific discovery. We intend to excite reflections on the modalities we study, and convince some researchers to study large tabular models.

Why Tabular Foundation Models Should Be a Research Priority

TL;DR

The paper argues that tabular data, despite its ubiquity in science and industry, remains underserved by foundation-model research. It defines Large Tabular Models (LTMs) and outlines design desiderata, data needs, and benchmarking approaches to enable scalable, adaptable tabular foundation models. It surveys current LTMs, analyzes the challenges of tabular generation with LLMs, and highlights real-world impacts in responsible AI, science, and data democratization. The authors advocate shifting resources toward LTMs, emphasizing practical pathways for development, evaluation, and responsible deployment, with potential to transform how tabular data is processed and reused across disciplines.

Abstract

Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources. In this position piece we aim to shift the ML research community's priorities ever so slightly to a different modality: tabular data. Tabular data is the dominant modality in many fields, yet it is given hardly any research attention and significantly lags behind in terms of scale and power. We believe the time is now to start developing tabular foundation models, or what we coin a Large Tabular Model (LTM). LTMs could revolutionise the way science and ML use tabular data: not as single datasets that are analyzed in a vacuum, but contextualized with respect to related datasets. The potential impact is far-reaching: from few-shot tabular models to automating data science; from out-of-distribution synthetic data to empowering multidisciplinary scientific discovery. We intend to excite reflections on the modalities we study, and convince some researchers to study large tabular models.
Paper Structure (17 sections, 2 figures, 1 table)

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Representation of different modalities in foundation model research across recent ML conferences, roughly estimated as the number of accepted papers with abstracts containing keywords (see Appendix \ref{['app:figure_details']}). LLMs are booming and tabular data is heavily underrepresented.
  • Figure 2: Sampling continuous distributions using LLMs autoregressively is inefficient. Assume we autoregressively sample tokens aiming to generate numbers that follow a standard Gaussian. What token probabilities should the LLM output at each sampling step? Let us consider a total vocabulary of just 102 tokens, $["-", ".", "00",...,"99"]$. For different direct histories of generated text (examples given by each row, already generated digits on left), the output probabilities need to be very different. For example, a single "draw" is generated by sampling from probabilities in the first row (e.g. giving"0"), then second row (conditional on "0", giving "."), fourth row ("0."$\rightarrow$ e.g. "00"), and last row ("0.00"$\rightarrow$ e.g. "00").