Table of Contents
Fetching ...

Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure

Davide Tugnoli, Andrea De Lorenzo, Marco Virgolin, Giovanni Cinà

TL;DR

Results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.

Abstract

Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high-quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN's generation process through two complementary approaches: Directed Acyclic Graph (DAG)-aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)-based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG-aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG-based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.

Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure

TL;DR

Results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.

Abstract

Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high-quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN's generation process through two complementary approaches: Directed Acyclic Graph (DAG)-aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)-based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG-aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG-based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.
Paper Structure (37 sections, 5 equations, 27 figures, 5 tables)

This paper contains 37 sections, 5 equations, 27 figures, 5 tables.

Figures (27)

  • Figure 1: Hodges--Lehmann estimates comparing vanilla with original ordering versus topological ordering in . Positive values indicate that topological ordering achieves lower metric values (i.e., better synthetic data quality). Filled markers with solid error bars indicate significance at $p<0.05$ (Holm correction).
  • Figure 2: Hodges--Lehmann estimates comparing vanilla with original ordering versus -aware generation (left) and versus minimal (right) in . Positive values indicate that the respective method achieves lower metric values (i.e., better synthetic data quality). Filled markers with solid error bars indicate significance at $p<0.05$ (Holm correction).
  • Figure 3: Hodges--Lehmann estimates of the reduction in absolute error ($\Delta_{\text{ATE}}$) when comparing vanilla with original ordering versus topological ordering on CSuite and datasets. Positive values indicate that topological ordering produces smaller errors (closer to ground truth), while negative values indicate larger errors. Filled markers with solid error bars indicate significance at $p<0.05$ (Holm correction).
  • Figure 4: Hodges--Lehmann estimates of the reduction in absolute error ($\Delta_{\text{ATE}}$) when comparing vanilla with original ordering versus -aware generation (left) and versus minimal (right) on preservation. Positive values indicate that the respective method (-aware or -based) produces smaller errors (closer to ground truth), while negative values indicate larger errors. Filled markers with solid error bars indicate significance at $p<0.05$ (Holm correction).
  • Figure A1: Order sensitivity for vanilla on . Y-axis shows the range of (left), ($k=2$, center), and (right) values across three feature orderings: original, topological, and reverse topological. Shaded regions show 95% bootstrap confidence intervals.
  • ...and 22 more figures