Table of Contents
Fetching ...

LaTable: Towards Large Tabular Models

Boris van Breugel, Jonathan Crabbé, Rob Davis, Mihaela van der Schaar

TL;DR

Through extensive experiments, this work finds that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.

Abstract

Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better zero- and few-shot generation capabilities.

LaTable: Towards Large Tabular Models

TL;DR

Through extensive experiments, this work finds that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.

Abstract

Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better zero- and few-shot generation capabilities.

Paper Structure

This paper contains 23 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The diffusion pipeline, for both training and generation. Note that the whole pipeline is flexible with respect to the number of numerical and categorical features as input. The LLM is encoder is frozen and the transformer is an encoder-only model without positional encodings or causal masking. Additional conditioning (e.g. missingness mask, conditioning information) can be trivially added to the transformer input, or through cross-attention layers.
  • Figure 2: LaTable outperforms baselines especially for smaller datasets. Generation results for different generative models across 78 datasets (sorted by size), with 5 seeds per (model, dataset) and a LOWESS curve fitted for smoothing.
  • Figure 3: A finetuned LaTable outperforms baselines significantly on out-of-distribution datasets with few samples. Generation performance on out-of-distribution datasets as a function of the training dataset size. Metrics are averaged over 5 datasets in $\mathcal{D}_{\mathrm{ood}}$, where datasets are reduced to $n_{samples}$. Mean and error bars (95%) are computed over 5 seeds.
  • Figure 4: Number of samples per dataset plotted against number of features, with in-distribution datasets $\mathcal{D}_{\mathrm{id}}$ in blue and $\mathcal{D}_{\mathrm{ood}}$ in orange. Note: double log-scale.
  • Figure 5: Features in $\mathcal{D}_\mathrm{id}$ do not cover all features in $\mathcal{D}_\mathrm{ood}$. Cosine similarity matrix between feature embeddings in $\mathcal{D}_\mathrm{ood}$ and $\mathcal{D}_\mathrm{id}$. For 0.752 of the ood-features, all id-features are dissimilar ( similarity$<0.8$). Note: the three little diagonal lines refer to dataset "Kc2 Software Defects" (OpenML ID 1063) and 3 similar datasets that contain the same features.
  • ...and 3 more figures