Table of Contents
Fetching ...

Engineering Regression Without Real-Data Training: Domain Adaptation for Tabular Foundation Models Using Multi-Dataset Embeddings

Lyle Regenwetter, Rosen Yu, Cyril Picard, Faez Ahmed

TL;DR

The results indicate that principled synthetic data curation can convert procedural generators into domain-relevant"data engines," enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.

Abstract

Predictive modeling in engineering applications has long been dominated by bespoke models and small, siloed tabular datasets, limiting the applicability of large-scale learning approaches. Despite recent progress in tabular foundation models, the resulting synthetic training distributions used for pre-training may not reflect the statistical structure of engineering data, limiting transfer to engineering regression. We introduce TREDBench, a curated collection of 83 real-world tabular regression datasets with expert engineering/non-engineering labels, and use TabPFN 2.5's dataset-level embedding to study domain structure in a common representation space. We find that engineering datasets are partially distinguishable from non-engineering datasets, while standard procedurally generated datasets are highly distinguishable from engineering datasets, revealing a substantial synthetic-real domain gap. To bridge this gap without training on real engineering samples, we propose an embedding-guided synthetic data curation method: we generate and identify "engineering-like" synthetic datasets, and perform continued pre-training of TabPFN 2.5 using only the selected synthetic tasks. Across 35 engineering regression datasets, this synthetic-only adaptation improves predictive accuracy and data efficiency, outperforming TabPFN 2.5 on 29/35 datasets and AutoGluon on 27/35, with mean multiplicative data-efficiency gains of 1.75x and 4.44x, respectively. More broadly, our results indicate that principled synthetic data curation can convert procedural generators into domain-relevant "data engines," enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.

Engineering Regression Without Real-Data Training: Domain Adaptation for Tabular Foundation Models Using Multi-Dataset Embeddings

TL;DR

The results indicate that principled synthetic data curation can convert procedural generators into domain-relevant"data engines," enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.

Abstract

Predictive modeling in engineering applications has long been dominated by bespoke models and small, siloed tabular datasets, limiting the applicability of large-scale learning approaches. Despite recent progress in tabular foundation models, the resulting synthetic training distributions used for pre-training may not reflect the statistical structure of engineering data, limiting transfer to engineering regression. We introduce TREDBench, a curated collection of 83 real-world tabular regression datasets with expert engineering/non-engineering labels, and use TabPFN 2.5's dataset-level embedding to study domain structure in a common representation space. We find that engineering datasets are partially distinguishable from non-engineering datasets, while standard procedurally generated datasets are highly distinguishable from engineering datasets, revealing a substantial synthetic-real domain gap. To bridge this gap without training on real engineering samples, we propose an embedding-guided synthetic data curation method: we generate and identify "engineering-like" synthetic datasets, and perform continued pre-training of TabPFN 2.5 using only the selected synthetic tasks. Across 35 engineering regression datasets, this synthetic-only adaptation improves predictive accuracy and data efficiency, outperforming TabPFN 2.5 on 29/35 datasets and AutoGluon on 27/35, with mean multiplicative data-efficiency gains of 1.75x and 4.44x, respectively. More broadly, our results indicate that principled synthetic data curation can convert procedural generators into domain-relevant "data engines," enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.
Paper Structure (32 sections, 2 equations, 7 figures, 3 tables)

This paper contains 32 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: 2D T-SNE embedding of TabPFN 2.5's 192-dimensional dataset embedding. Distributional differences between engineering and non-engineering data are apparent. A small proportion of the procedurally generated data overlaps with the engineering data. The procedurally generated data appears to splay off toward random noise, indicating that a significant amount of the procedurally generated training data may be too random to be realistic.
  • Figure 2: Confusion Matrix illustrating distinguishability of dataset classes (according to TabPFN 2.5's embedding). Distributional differences between engineering and non-engineering datasets can be witnessed. A small portion of the procedurally generated data appears to span the engineering data.
  • Figure 3: Visualization of procedurally generated data that is subselected to maximally resemble engineering data. Embedding is a 2D t-SNE of the TabPFN 2.5's 192-dimensional dataset embedding. The selected procedurally generated appears to better span the engineering data compared to the randomly generated data.
  • Figure 4: Confusion matrix illustrating that subselected procedurally generated data and engineering data are less easily distinguished (based on TabPFN 2.5's embedding).
  • Figure 5: Sample comparison of model accuracy for different quantities of data over 12 example problems.
  • ...and 2 more figures