Table of Contents
Fetching ...

Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

Rafael Izbicki, Pedro L. C. Rodrigues

Abstract

Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

Abstract

Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

Paper Structure

This paper contains 41 sections, 4 equations, 49 figures, 2 tables.

Figures (49)

  • Figure 1: Illustration of CDE on a bimodal synthetic DGP. Each row corresponds to a different test instance whose true conditional density (dashed black) is a two-component Gaussian mixture with covariate-dependent means, variances, and weights. Columns show three training sizes ($n\in\{50,200,2{,}000\}$). TabPFN-2.5 (orange) already captures the bimodal structure at $n=200$, while Flow-Spline (blue) and FlexCode-RF (red) require considerably more data and still show spurious peaks or roughness. Note that this is a controlled synthetic example designed to illustrate differences among methods; see Section \ref{['sec:results']} for real data analyses.
  • Figure 2: CDE loss across real-world datasets at $n = 50$. Per-dataset raw CDE loss values (lower/greener is better). Datasets are sorted by covariate dimension $d$. A $*$ marks foundation models that significantly outperform all parametric and nonparametric competitors on that dataset. The two TabPFN variants (orange) achieve the top two average ranks (bottom row), while TabICL-Quantiles (7.0) is narrowly outranked by Student-t-Ridge (6.6). Even at this extremely small sample size, a foundation model achieves the best CDE loss on 82% of datasets.
  • Figure 3: CDE loss across real-world datasets at $n = 1{,}000$. Per-dataset raw CDE loss values (lower/greener is better). Datasets are sorted by $d$. A $*$ marks foundation models that significantly outperform all competitors on that dataset. All foundation models occupy the top three average ranks. Foundation models achieve the best CDE loss on 92% of datasets.
  • Figure 4: CDE loss across real-world datasets at $n = 20{,}000$. Per-dataset raw CDE loss values (lower/greener is better). Datasets are sorted by $d$. A $*$ marks foundation models that significantly outperform all competitors on that dataset; $\times$ marks foundation models that encountered out-of-memory errors. All three foundation models occupy the top three average ranks. On CTSlices, both TabPFN variants ran out of memory at this sample size.
  • Figure 5: CDE loss vs. sample size for selected real-world datasets. Each panel shows one dataset, arranged by increasing $d$. Foundation models (orange, bold) are contrasted against parametric (green, faded) and nonparametric (blue, faded) baselines; the $y$-axis is oriented so that higher values are better. Foundation models are already competitive at $n=50$ and often exhibit the steepest improvement with growing $n$. Parametric baselines plateau early and rarely reach the top regardless of sample size. Nonparametric methods show greater variability, occasionally matching foundation models in higher-dimensional settings but without consistency.
  • ...and 44 more figures